SGD with momentum : How is it different with SGD ?

SGD with momentum is an optimizer that minimizes the impact of noises in convergence to the optimal weights. Stochastic Gradient Descent momentum just helps to reduce the convergence time. Adding momentum in SGD overcome the major shortcomings of SGD over Batch Gradient Descent without losing its advantage.

SGD without momentum :

In order to understand the SGD with Momentum, We will first see the SGD without momentum.

In ordinary SGD we have the below formula.

$w(new)= w(old)- \eta \frac{\partial L}{\partial w(old)}$

The above noise and random fluctuations are because of small batches or single data points. This increases the convergence time.

SGD with momentum –

The objective of the momentum is to give a more stable direction to the convergence optimizer. Hence we will add an exponential moving average in the SGD weight update formula.

weight update with momentum

Here we have added the momentum factor. Now let’s see how this momentum component calculated.

This is just an exponential moving average of loss derivative with respect to weights. This will stabilize the converging function. See here any moving average helps to add the component of the previous data point on the current data point. Hence if any noise data come, It will not be so much influencing. Now let’s see the convergence diagram.

Here we can see, The momentum factor is reducing the fluctuations in the weight updates.

Conclusion –

Stochastic Gradient Descent and mini-batch gradient descent is more suitable than Batch gradient descent in real scenarios. But just because of the noise and local minima problem they take more time in convergence in real scenarios. But with momentum, these optimizers becomes more efficient. I have tried to keep this article lean and informative. But if you have any doubt related to any of the points we discussed, please comment below. We will surely revert you on that.

Thanks

Data Science Learner Team