Mastering Gradient Descent: A Deep Dive into RMSprop and Adam Optimizers
Harnessing the Power of Adaptive Learning for Superior Deep Learning Performance
In the rapidly evolving field of deep learning, the choice of optimization algorithm plays a critical role in determining the efficiency and effectiveness of training neural networks.
RMSprop and Adam optimizers stand out due to their adaptive learning rate capabilities, which address the limitations of traditional gradient descent methods. Understanding these algorithms is essential for researchers and practitioners aiming to enhance model performance, reduce training time, and achieve more accurate results. By studying RMSprop and Adam, we gain valuable insights into the mechanics of gradient optimization, enabling us to build more robust and scalable AI systems.
RMSprop, which stands for Root Mean Square Propagation, is an adaptive learning rate optimization algorithm designed specifically for training neural networks. Geoffrey Hinton proposed it in a lecture to address the diminishing learning rates problem in algorithms like AdaGrad.
Discovery and Purpose
RMSprop was developed as a solution to improve upon the limitations of AdaGrad, particularly its rapid decrease in learning rates. While AdaGrad adjusts the learning rate based on the gradient history, it tends to make the learning rate too small, causing slow convergence or stalling when dealing with deep neural networks. RMSprop introduces a mechanism to maintain a more balanced and adaptive learning rate over time, preventing it from becoming excessively small.
How RMSprop Works
RMSprop modifies the learning rate for each parameter based on the average of recent magnitudes of the gradients for that parameter. The main components of RMSprop are as follows:
1. Exponential Decay Average of Squared Gradients:
RMSprop maintains a moving average of the squared gradients:
2. Adaptive Learning Rate:
The parameter update rule in RMSprop is:
By using a moving average of squared gradients, RMSprop adapts the learning rate for each parameter, helping the optimization process to be more stable and effective.
Improvements Over Other Optimizers
- Momentum: Momentum adds a fraction of the previous update to the current update, helping to smooth the gradient descent trajectory.
- RMSprop: RMSProp improves upon Momentum by adapting the learning rate for each parameter based on the moving average of squared gradients, addressing the problem of varying gradients more effectively.
2. Nesterov Accelerated Gradient (NAG):
- NAG is a variant of Momentum that anticipates the future position of the parameters, leading to more informed updates.
- RMSprop focuses on adjusting the learning rate dynamically for each parameter, whereas NAG emphasizes velocity and gradient anticipation.
3. AdaGrad:
- AdaGrad scales the learning rate by the inverse of the sum of squared gradients, which can lead to excessively small learning rates.
- RMSprop overcomes this by using a moving average of squared gradients instead of their sum, preventing the rapid decay of learning rates.
4. Adadelta:
- Adadelta is an extension of AdaGrad that seeks to reduce the aggressive, monotonically decreasing learning rate by restricting the window of accumulated past gradients to a fixed size.
- RMSprop similarly adjusts the learning rate based on recent gradient magnitudes but does so more straightforwardly with a simple exponential decay average.
Advantages of RMSProp
- Adaptive Learning Rates: RMSprop adjusts the learning rate for each parameter based on recent gradient magnitudes, making it more adaptive to different scales and improving convergence.
- Stabilizes Training: Using an exponentially decaying average of squared gradients, RMSprop prevents the learning rate from becoming too small too quickly, a common issue in AdaGrad.
- Simplicity and Effectiveness: RMSprop is relatively simple to implement and often works well in practice, making it a popular choice for training deep neural networks.
ADAM Optimizer
ADAM (Adaptive Moment Estimation) is an optimization algorithm widely used in training deep learning models. It combines the advantages of two other popular optimization techniques: AdaGrad and RMSProp. Adam adapts the learning rate for each parameter and computes adaptive learning rates based on the moments of the gradients. Let’s delve into the details, starting from its discovery to its improvements over other optimizers.
Discovery of ADAM
Adam was introduced by Diederik P. Kingma and Jimmy Ba in their 2014 paper titled “Adam: A Method for Stochastic Optimization”. The algorithm is designed to efficiently handle sparse gradients on noisy problems, making it particularly suitable for deep learning and large-scale machine learning tasks.
How does ADAM work?
Adam combines the benefits of two other optimization algorithms:
1. AdaGrad: This algorithm adapts the learning rate based on the frequency of updates for each parameter. Parameters with infrequent updates get a relatively larger learning rate, while those with frequent updates get a smaller learning rate. However, AdaGrad can suffer from rapid decay in the learning rate, making it inefficient in the later stages of training.
2. RMSProp: RMSProp also adjusts the learning rate for each parameter, but it uses an exponentially decaying average of squared past gradients to scale the learning rate. This addresses the rapid decay issue of AdaGrad but doesn’t utilize momentum, which can help in faster convergence.
Adam integrates the ideas from both AdaGrad and RMSProp and also incorporates momentum:
- Momentum: It helps accelerate gradient vectors in the right direction, leading to faster converging.
Mathematical Formulation
The algorithm maintains two moving averages for each parameter: the first moment (mean) and the second moment (uncentered variance). The update rule for each parameter \(\theta\) at step \(t\) is as follows:
1. Gradient Calculation:
2. First Moment Estimate (mean of the gradients):
3. Second Moment Estimate (uncentered variance of the gradients):
4. Bias Correction:
5. Parameter Update:
Improvements Over Other Optimizers
- Adam vs. Momentum:
- Momentum uses only the first moment (mean of the gradients) to update the parameters. While it helps in accelerating the convergence, it can overshoot minima.
- Adam uses both the first moment and the second moment (variance of the gradients), which provides a more robust convergence by considering both the magnitude and variance of the gradients.
2. Adam vs. NAG (Nesterov Accelerated Gradient):
- NAG improves upon momentum by looking ahead to where the parameters will be and computing the gradient there, which can lead to more informed updates.
- Adam incorporates adaptive learning rates and momentum, offering a more nuanced approach to adjusting the learning rates and accounting for gradient variance, which can lead to better performance on a variety of tasks.
3. Adam vs. Adadelta:
- Adadelta is an extension of AdaGrad that seeks to reduce its aggressive, monotonically decreasing learning rate by using a moving window of gradient updates to adjust the learning rates.
- Adam incorporates both the adaptive learning rate approach of Adadelta and the momentum, leading to a more stable and efficient optimization process.
4. Adam vs. AdaGrad :
- AdaGrad adapts learning rates but can have diminishing learning rates over time, making it less effective for long training sessions.
- Adam overcomes this by using moving averages of the gradients and their squares, which stabilizes the learning rates and improves convergence over longer periods.
Advantages of Adam
- Efficient: Works well with large datasets and high-dimensional parameter spaces.
- Adaptive: Adjusts learning rates individually for each parameter, leading to better performance.
- Robust: Combines the advantages of AdaGrad and RMSProp with momentum, leading to faster convergence and more robust performance.
Disadvantages of Adam
- Memory Usage: Requires additional memory to store the first and second moments.
- Hyperparameters: More hyperparameters to tune (learning rate, β₁,β₂, ϵ), which can add complexity.
Practical Implemenatation
Now let’s see how these algorithms improve the model to make it more efficient.
We have used the wine quality dataset from Kaggle to apply these models.
We have tried to rate the wine quality by tuning it into a regression model.
The reason for this is that linear activation allows the model to output any real number, which is suitable for regression tasks where the target variable can take on a wide range of values. Since you want to predict wine quality on a scale of 1–10, a linear activation function will allow your model to output values within this range without any constraints.
Please check out the entire code on Kaggle
Stochastic Gradient Descent (SGD)
We start with SGD to act as the control for this comparison.
Model: "sequential_26"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param# ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense_104 (Dense) │ (None, 16) │ 192 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_105 (Dense) │ (None, 8) │ 136 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_106 (Dense) │ (None, 4) │ 36 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_107 (Dense) │ (None, 1) │ 5 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 369 (1.44 KB)
Trainable params: 369 (1.44 KB)
Non-trainable params: 0 (0.00 B)
Epoch 1/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 19.2480 - mse: 19.2480 - val_loss: 0.6383 - val_mse: 0.6343
Epoch 2/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.5786 - mse: 0.5786 - val_loss: 0.5619 - val_mse: 0.5552
Epoch 3/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.5670 - mse: 0.5670 - val_loss: 0.6795 - val_mse: 0.6693
...
...
...
Epoch 47/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.4955 - mse: 0.4955 - val_loss: 0.4814 - val_mse: 0.4745
Epoch 48/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.4357 - mse: 0.4357 - val_loss: 0.4451 - val_mse: 0.4378
Epoch 49/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.4621 - mse: 0.4621 - val_loss: 0.4496 - val_mse: 0.4425
Epoch 50/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.4534 - mse: 0.4534 - val_loss: 0.4442 - val_mse: 0.4370
Let’s see how the model optimizes the losses while training over the data.
RMSprop Optimizer
Model: "sequential_27"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
│ Layer (type) ┃ Output Shape ┃ Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense_108 (Dense) │ (None, 16) │ 192 │
├─────────────────────────────────┼──────────────────────┼───────────────┤
│ dense_109 (Dense) │ (None, 8) │ 136 │
├─────────────────────────────────┼──────────────────────┼───────────────┤
│ dense_110 (Dense) │ (None, 4) │ 36 │
├─────────────────────────────────┼──────────────────────┼───────────────┤
│ dense_111 (Dense) │ (None, 1) │ 5 │
└─────────────────────────────────┴──────────────────────┴───────────────┘
Total params: 369 (1.44 KB)
Trainable params: 369 (1.44 KB)
Non-trainable params: 0 (0.00 B)
Epoch 1/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 32.2893 - mse: 32.2893 - val_loss: 30.6567 - val_mse: 30.6527
Epoch 2/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 29.6324 - mse: 29.6324 - val_loss: 27.4650 - val_mse: 27.4626
Epoch 3/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 26.4368 - mse: 26.4368 - val_loss: 23.0276 - val_mse: 23.0236
***
***
***
Epoch 47/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.4268 - mse: 0.4268 - val_loss: 0.4348 - val_mse: 0.4295
Epoch 48/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.4380 - mse: 0.4380 - val_loss: 0.4360 - val_mse: 0.4306
Epoch 49/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.4102 - mse: 0.4102 - val_loss: 0.4379 - val_mse: 0.4329
Epoch 50/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.4349 - mse: 0.4349 - val_loss: 0.4307 - val_mse: 0.4252
Let’s see how the model optimizes the losses over the training period.
Finally, we see the use of ADAM Optimizer,
Adam Optimizer
Model: "sequential_28"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense_112 (Dense) │ (None, 16) │ 192 │
├──────────────────────────────┼──────────────────────┼──────────────┤
│ dense_113 (Dense) │ (None, 8) │ 136 │
├──────────────────────────────┼──────────────────────┼──────────────┤
│ dense_114 (Dense) │ (None, 4) │ 36 │
├────────────────────────── ───┼──────────────────────┼──────────────┤
│ dense_115 (Dense) │ (None, 1) │ 5 │
└──────────────────────────────┴──────────────────────┴──────────────┘
Total params: 369 (1.44 KB)
Trainable params: 369 (1.44 KB)
Non-trainable params: 0 (0.00 B)
Epoch 1/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 2s 10ms/step - loss: 29.2911 - mse: 29.2911 - val_loss: 27.9455 - val_mse: 27.9505
Epoch 2/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 27.2706 - mse: 27.2706 - val_loss: 24.2859 - val_mse: 24.2967
Epoch 3/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 23.0945 - mse: 23.0945 - val_loss: 19.2644 - val_mse: 19.2776
....
....
....
Epoch 47/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.4043 - mse: 0.4043 - val_loss: 0.4262 - val_mse: 0.4207
Epoch 48/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.4306 - mse: 0.4306 - val_loss: 0.4214 - val_mse: 0.4162
Epoch 49/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.4206 - mse: 0.4206 - val_loss: 0.4209 - val_mse: 0.4157
Epoch 50/50
25/25 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.4258 - mse: 0.4258 - val_loss: 0.4184 - val_mse: 0.4130
Let's see the trend of loss optimization and improvement in accuracy.
Observations :
- The SDG Model takes time to optimize the problem, judging by the Loss optimization problem but gets a good result in the end. The gap between the training and the test results varies highly indicating overfitting tendencies.
- The RMSProp model archives the same result with less effort and reduces the losses even further than the SDG model.
- The Adam Optimizer goes beyond, improving upon the RMSProp and gives slightly better results. Based on the graphs, the performance is barely different.
- MAE (Mean Absolute Error) is preferred as it directly measures the average magnitude of errors in predictions, treating all errors equally, which aligns well with ordinal classification. In contrast, R² (coefficient of determination) assesses variance explained by the model, better suited for continuous regression tasks rather than ordinal classification.
Conclusion
Adam has become one of the most popular optimization algorithms for deep learning due to its adaptability and robustness. By incorporating elements of both AdaGrad and RMSProp along with momentum, it addresses many of the limitations of previous optimizers and provides a more efficient and effective way to train deep learning models.
Overall, RMSprop provides a balanced approach to learning rate adaptation, combining the strengths of Momentum and adaptive learning rate methods like AdaGrad and Adadelta, making it a robust optimizer for various neural network architectures.