Attention Heads

Definition

Sub-components of the Multi-Head Attention mechanism within Transformer architectures. They enable the model to simultaneously attend to information from different representation subspaces at different positions.

Mechanism

Each head performs independent Scaled Dot-Product Attention using partitioned $Q$ (Query), $K$ (Key), and $V$ (Value) projections.
The outputs of all heads are concatenated and linearly transformed to produce the final output of the attention layer.

Inference and Performance Context

Inference Complexity: The computational workload of processing multiple heads is a primary driver of the complexity found in modern LLM Inference engines.
Memory and Throughput: Efficient management of the projections generated by these heads is critical to Performance Optimization, particularly regarding the memory bandwidth bottlenecks addressed by advanced Memory Mapping techniques.
Optimization Strategies: Scaling the number of heads and their dimensions requires specialized Inference Engines to handle the high-throughput requirements of concurrent head computations.

Backlink: 2026 04 22 LLM Inference Engines Memory Mapping and Performance Optimization

Source Notes

2026-04-22: LLM [[concepts/inference|Inference: Engines, Memory Mapping, and Performance Optimization]]

NemoClaw Knowledge Wiki

Explorer

attention-heads

Attention Heads

Definition

Mechanism

Inference and Performance Context

Source Notes

Graph View

Table of Contents

Backlinks