Attention Residuals

Attention Residuals (often abbreviated as AttnRes) refers to architectural modifications in Transformer-based large-language-models designed to mitigate issues associated with Pre-Normalization in deep networks, specifically the phenomenon known as pre-norm dilution.

Core Problem: Pre-Norm Dilution

In standard pre-norm Transformer architectures, the residual connections add the input to the output of the attention and feed-forward layers before normalization. In very deep networks, this can lead to “dilution” where the signal from the attention mechanism becomes negligible compared to the residual path, effectively causing the attention blocks to vanish or become less influential in deeper layers. This limits the effective depth of the model and the complexity of patterns it can learn in later stages.

Solution: Attention Residuals

Proposed by the Kimi Team (Moonshot AI), Attention Residuals introduce a mechanism to preserve the magnitude and influence of the attention output through deeper layers. By restructuring how residuals are applied or scaling the attention outputs relative to the residual stream, the architecture ensures that attention mechanisms remain potent contributors to the final representation, even in extremely deep large-language-models.

Key Developments