Multi-Head Attention
Multi-Head Attention (MHA) is a foundational mechanism in Transformer architectures that allows the model to jointly attend to information from different representation subspaces at different positions. It extends Scaled Dot-Product Attention by projecting inputs into distinct heads, computing attention in parallel, and concatenating the outputs before a final linear transformation.
Formulation
For input matrix , learnable weight matrices generate Queries (), Keys (), and Values ():
Attention is computed per head using the attention function, then concatenated and projected:
Properties & Function
- Subspace Representation: Each head learns distinct features (e.g., syntactic structure, semantic role, positional relations), increasing model expressivity without proportional computational cost.
- Dynamic Contextualization: Attention weights are computed based on Query-Key-Value interactions, allowing the model to adaptively weight token contributions based on global sequence context.
- Parallel Execution: Heads operate independently, enabling efficient computation across hardware accelerators.
- Output Aggregation: Concatenated heads are linearly projected via to blend learned features, mitigating interference between specialized heads.
Integrated Analysis
- Transformer Attention Mechanism Explained: Contextual Embeddings and QKV System integrates 3Blue1Brown’s visual intuition on how the QKV system constructs Contextual Embedding vectors by measuring geometric relevance between token representations.
- The QKV mechanism functions as a relational query interface: Queries probe the sequence via Keys to generate attention scores, which weight Values to produce output vectors enriched with position-dependent context.
- Visual analysis indicates that individual heads can specialize in capturing specific patterns, such as character-level n-grams, syntactic dependencies, or long-range semantic links, collectively enabling large-language-model coherence.
Related Concepts
- Scaled Dot-Product Attention
- self-attention
- cross-attention
- Encoder-Decoder Architecture
- Positional Encoding