Virtual Labs

1. Which activation function is zero-centred with output range (-1, 1)?

a: Sigmoid Explanation

Explanation

b: Tanh Explanation

Explanation

c: ReLU Explanation

Explanation

d: Linear Explanation

Explanation

2. What is a potential drawback of using ReLU activation?

a: It saturates for large positive inputs Explanation

Explanation

b: It causes vanishing gradients for all inputs Explanation

Explanation

c: Dying ReLU problem where neurons stop learning Explanation

Explanation

d: It outputs values between 0 and 1 only Explanation

Explanation

3. Why can Sigmoid activation slow down training?

a: Because it causes neurons to die Explanation

Explanation

b: Because it is not differentiable Explanation

Explanation

c: Because it saturates and causes vanishing gradients Explanation

Explanation

d: Because it outputs values only in negative range Explanation

Explanation

4. What is a drawback of the 'dying ReLU' problem?

a: Neurons output zero and stop learning Explanation

Explanation

b: Neurons output large positive values Explanation

Explanation

c: Training is faster Explanation

Explanation

d: Model becomes more complex Explanation

Explanation

5. Which optimizer combines momentum and adaptive learning rates?

a: SGD Explanation

Explanation

b: Adam Explanation

Explanation

c: RMSProp Explanation

Explanation

d: AdaGrad Explanation

Explanation

6. How does the Adam optimizer differ from standard SGD?

a: Adam uses fixed learning rate for all parameters Explanation

Explanation

b: Adam maintains moving averages of past gradients and squared gradients Explanation

Explanation

c: Adam ignores gradient information during updates Explanation

Explanation

d: Adam always requires manual tuning of momentum separately Explanation

Explanation

7. What is the main advantage of using adaptive optimizers like Adam?

a: They adjust learning rates individually for each parameter Explanation

Explanation

b: They require no hyperparameter tuning at all Explanation

Explanation

c: They eliminate the need for backpropagation Explanation

Explanation

d: They always prevent overfitting Explanation

Explanation

8. What is the update rule of parameters in Stochastic Gradient Descent?

a: θ(t+1) = θ_t + η ∇θ L(θ_t) Explanation

Explanation

b: θ(t+1) = θ_t - η ∇θ L(θ_t) Explanation

Explanation

c: θ(t+1) = η ∇θ L(θ_t) Explanation

Explanation

d: θ(t+1) = θ_t / η Explanation

Explanation

9. What does the term 'moment' refer to in Adam optimizer?

a: The number of epochs Explanation

Explanation

b: Running averages of past gradients and squared gradients Explanation

Explanation

c: The time taken per update Explanation

Explanation

d: The size of the training batch Explanation

Explanation

10. Which activation function is non-differentiable at zero?

a: Sigmoid Explanation

Explanation

b: ReLU Explanation

Explanation

c: Tanh Explanation

Explanation

d: Softmax Explanation

Explanation

Activation Functions & Optimization