Vanishing Gradient Problem

The vanishing gradient problem occurs during the training of deep neural networks when gradients computed via backpropagation become exponentially smaller as they propagate backward through many layers. Since weight updates are proportional to these gradients, layers near the input receive such minute gradient signals that their weights update negligibly, effectively halting learning in those layers.

Root Cause

The problem arises from the chain rule of calculus used in backpropagation. When computing gradients through many layers, partial derivatives are multiplied together. If individual layer derivatives are small (typically between 0 and 1, as with sigmoid or tanh activation functions), successive multiplications produce progressively smaller values. In networks with dozens or hundreds of layers, these products can become so small that they approach machine precision limits, rendering gradient updates ineffective.

Impact and Solutions

Early layers in deep networks become difficult to train because their weight updates stagnate. This particularly affects recurrent neural networks processing long sequences, where gradients must backpropagate through many time steps. Common solutions include using activation functions with gentler gradients like ReLU, employing batch normalization to stabilize gradient flow, initializing weights carefully, and using LSTM or GRU architectures that include mechanisms to preserve gradients across long sequences.

Source Notes