Attention Heads

Definition

Sub-components of the Multi-Head Attention mechanism within Transformer architectures. They enable the model to simultaneously attend to information from different representation subspaces at different positions.

Mechanism

  • Each head performs independent Scaled Dot-Product Attention using partitioned (Query), (Key), and (Value) projections.
  • The outputs of all heads are concatenated and linearly transformed to produce the final output of the attention layer.

Inference and Performance Context

  • Inference Complexity: The computational workload of processing multiple heads is a primary driver of the complexity found in modern LLM Inference engines.
  • Memory and Throughput: Efficient management of the projections generated by these heads is critical to Performance Optimization, particularly regarding the memory bandwidth bottlenecks addressed by advanced Memory Mapping techniques.
  • Optimization Strategies: Scaling the number of heads and their dimensions requires specialized Inference Engines to handle the high-throughput requirements of concurrent head computations.

Backlink: 2026 04 22 LLM Inference Engines Memory Mapping and Performance Optimization

Source Notes

  • 2026-04-22: LLM [[concepts/inference|Inference: Engines, Memory Mapping, and Performance Optimization]]