Attention Residuals
Attention Residuals (often abbreviated as AttnRes) refers to architectural modifications in Transformer-based large-language-models designed to mitigate issues associated with Pre-Normalization in deep networks, specifically the phenomenon known as pre-norm dilution.
Core Problem: Pre-Norm Dilution
In standard pre-norm Transformer architectures, the residual connections add the input to the output of the attention and feed-forward layers before normalization. In very deep networks, this can lead to “dilution” where the signal from the attention mechanism becomes negligible compared to the residual path, effectively causing the attention blocks to vanish or become less influential in deeper layers. This limits the effective depth of the model and the complexity of patterns it can learn in later stages.
Solution: Attention Residuals
Proposed by the Kimi Team (Moonshot AI), Attention Residuals introduce a mechanism to preserve the magnitude and influence of the attention output through deeper layers. By restructuring how residuals are applied or scaling the attention outputs relative to the residual stream, the architecture ensures that attention mechanisms remain potent contributors to the final representation, even in extremely deep large-language-models.
Key Developments
- Kimi Team Proposal: The Kimi Team introduced this breakthrough to address the scaling limits of deep Pre-Normalization architectures.
- Impact: Enables training of deeper models without the degradation of attention signal fidelity, potentially improving performance on tasks requiring complex reasoning or long-range dependency handling.
Related Concepts
- Pre-Normalization
- Residual Connections
- Transformer Architecture
- large-language-model