Attention Heads
Definition
Sub-components of the Multi-Head Attention mechanism within Transformer architectures. They enable the model to simultaneously attend to information from different representation subspaces at different positions.
Mechanism
- Each head performs independent Scaled Dot-Product Attention using partitioned (Query), (Key), and (Value) projections.
- The outputs of all heads are concatenated and linearly transformed to produce the final output of the attention layer.
Inference and Performance Context
- Inference Complexity: The computational workload of processing multiple heads is a primary driver of the complexity found in modern LLM Inference engines.
- Memory and Throughput: Efficient management of the projections generated by these heads is critical to Performance Optimization, particularly regarding the memory bandwidth bottlenecks addressed by advanced Memory Mapping techniques.
- Optimization Strategies: Scaling the number of heads and their dimensions requires specialized Inference Engines to handle the high-throughput requirements of concurrent head computations.
Backlink: 2026 04 22 LLM Inference Engines Memory Mapping and Performance Optimization
Source Notes
- 2026-04-22: LLM [[concepts/inference|Inference: Engines, Memory Mapping, and Performance Optimization]]