Multi-Head Attention

Multi-Head Attention (MHA) is a foundational mechanism in Transformer architectures that allows the model to jointly attend to information from different representation subspaces at different positions. It extends Scaled Dot-Product Attention by projecting inputs into distinct heads, computing attention in parallel, and concatenating the outputs before a final linear transformation.

Formulation

For input matrix , learnable weight matrices generate Queries (), Keys (), and Values ():

Attention is computed per head using the attention function, then concatenated and projected:

Properties & Function

  • Subspace Representation: Each head learns distinct features (e.g., syntactic structure, semantic role, positional relations), increasing model expressivity without proportional computational cost.
  • Dynamic Contextualization: Attention weights are computed based on Query-Key-Value interactions, allowing the model to adaptively weight token contributions based on global sequence context.
  • Parallel Execution: Heads operate independently, enabling efficient computation across hardware accelerators.
  • Output Aggregation: Concatenated heads are linearly projected via to blend learned features, mitigating interference between specialized heads.

Integrated Analysis

  • Transformer Attention Mechanism Explained: Contextual Embeddings and QKV System integrates 3Blue1Brown’s visual intuition on how the QKV system constructs Contextual Embedding vectors by measuring geometric relevance between token representations.
  • The QKV mechanism functions as a relational query interface: Queries probe the sequence via Keys to generate attention scores, which weight Values to produce output vectors enriched with position-dependent context.
  • Visual analysis indicates that individual heads can specialize in capturing specific patterns, such as character-level n-grams, syntactic dependencies, or long-range semantic links, collectively enabling large-language-model coherence.