Activation Functions & Optimization
1. Which activation function is zero-centred with output range (-1, 1)?
2. What is a potential drawback of using ReLU activation?
3. Why can Sigmoid activation slow down training?
4. What is a drawback of the 'dying ReLU' problem?
5. Which optimizer combines momentum and adaptive learning rates?
6. How does the Adam optimizer differ from standard SGD?
7. What is the main advantage of using adaptive optimizers like Adam?
8. What is the update rule of parameters in Stochastic Gradient Descent?
9. What does the term 'moment' refer to in Adam optimizer?
10. Which activation function is non-differentiable at zero?