Vanishing Gradient Problem in RNN: Brief Overview

vanishing gradient problem in rnn occurs when the derivate of the loss function with respect to the weight parameter becomes very small. This leads to a weight change of almost zero in initial layers of neural networks. Once the weight of layers will not update. The Loss function will not optimize. optimizer function will not converge at a global minimum. This overall situation is a vanishing gradient problem in RNN.

Why Activation Function is main factor vanishing gradient problem in RNN?

As we all know that sigmoid function derivative ranges between (0,0.25). See, The backpropagation follows the chain rule, Where we multiply derivative of layers each other. This multiplication will be deep in going towards the initial layer. It will tend to zero. This cause the vanishing gradient problem in Recurrent Neural Network.

Derivative of Sigmoid function — The derivative of the Sigmoid function

Simplification ( Numerical Approach)-

Suppose at 5th Layer of Back Propagation. As per the chain rule, We are multiplying the derivative 5 times ( Any value between 0 to 0.25)

0.21 × 0.11 × 0.06 × 0.11 × 0.15 = 0.000022869 ( Assuming any value between 0 to 0.25 for derivative )

This is almost negligible at 5 levels of the layer. Suppose If we have so many deep layers, It will be close to zero.

Is RELU a solution for the vanishing gradient problem?

Technically It solves the problem of Vanishing Gradient problem because Its derivative becomes 1. Hence it will make weight change negligible in the initial layer. But It will create the problem of exploding gradient problem.

a derivative of RELU Activation function

Actually, exploding Gradient occurs because if the derivative will be 1 always. In the Back Propogation step, It will affect a larger weight change in each layer at very epochs. Hence the optimizer will not converge easily at Global Minima.

vanishing gradient problem in rnn : Solution –

Weight initialization in such a way that It will reduce the chances of vanishing gradient problem.
Using LSTM variant of RNN ( Recurrent neural Network)