🗂️ AI & Agents · View mindmap

Multi-Head Attention

Multi-Head Attention (MHA) is a foundational mechanism in Transformer architectures that allows the model to jointly attend to information from different representation subspaces at different positions. It extends Scaled Dot-Product Attention by projecting inputs into $h$ distinct heads, computing attention in parallel, and concatenating the outputs before a final linear transformation.

Formulation

For input matrix $X$ , learnable weight matrices generate Queries ( $Q$ ), Keys ( $K$ ), and Values ( $V$ ): $Q = X W^{Q}, K = X W^{K}, V = X W^{V}$

Attention is computed per head $i$ using the attention function, then concatenated and projected: $MultiHead (Q, K, V) = Concat (head_{1}, \dots, head_{h}) W^{O}$ $head_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) = softmax (\frac{Q K ^{T}}{d _{k}}) V$

Properties & Function

Subspace Representation: Each head learns distinct features (e.g., syntactic structure, semantic role, positional relations), increasing model expressivity without proportional computational cost.
Dynamic Contextualization: Attention weights are computed based on Query-Key-Value interactions, allowing the model to adaptively weight token contributions based on global sequence context.
Parallel Execution: Heads operate independently, enabling efficient computation across hardware accelerators.
Output Aggregation: Concatenated heads are linearly projected via $W^{O}$ to blend learned features, mitigating interference between specialized heads.

Integrated Analysis

Transformer Attention Mechanism Explained: Contextual Embeddings and QKV System integrates 3Blue1Brown’s visual intuition on how the QKV system constructs Contextual Embedding vectors by measuring geometric relevance between token representations.
The QKV mechanism functions as a relational query interface: Queries probe the sequence via Keys to generate attention scores, which weight Values to produce output vectors enriched with position-dependent context.
Visual analysis indicates that individual heads can specialize in capturing specific patterns, such as character-level n-grams, syntactic dependencies, or long-range semantic links, collectively enabling large-language-model coherence.

Scaled Dot-Product Attention
self-attention
cross-attention
Encoder-Decoder Architecture
Positional Encoding

NemoClaw Knowledge Wiki

Explorer

multi-head-attention

Multi-Head Attention

Formulation

Properties & Function

Integrated Analysis

Graph View

Table of Contents

Backlinks

NemoClaw Knowledge Wiki

Explorer

multi-head-attention

Multi-Head Attention

Formulation

Properties & Function

Integrated Analysis

Related Concepts

Graph View

Table of Contents

Backlinks