Deep Transformer Networks

Deep Transformer Networks refer to Transformer architectures with significantly increased depth (layers) designed to enhance representational capacity. Unlike shallow models, deep transformers leverage stacking self-attention and feed-forward layers to model complex dependencies. Key challenges in depth scaling include vanishing gradients and information dilution, necessitating specialized architectural innovations.

Core Challenges in Depth Scaling

  • Gradient Vanishing/Exploding: Standard residual connections often fail to propagate gradients effectively in very deep stacks.
  • Pre-Norm Dilution: In Pre-Layer Normalization architectures, the signal-to-noise ratio can degrade as depth increases, leading to “over-smoothing” or loss of gradient magnitude.
  • Computational Complexity: Quadratic scaling of attention with sequence length combined with linear depth scaling demands efficient inference strategies.

Architectural Innovations

Standard Deepening Techniques

  • Pre-Layer Normalization: Stabilizes training by normalizing inputs before the sub-layers.
  • Residual Connections: Allows gradients to flow directly through layers, mitigating degradation.
  • Learning Rate Warmup: Gradually increases learning rate to stabilize early training in deep networks.

Recent Breakthroughs: Attention Residuals

References