Transformer Attention Mechanism

Core computation layer in Transformer architectures that computes weighted representations of input tokens based on global context, enabling parallel processing and long-range dependency capture.

Key Components

  • QKV Projections: Input embeddings are linearly projected into Query (), Key (), and Value () matrices to compute attention scores.
  • Attention Weights: Derived from similarity between queries and keys, scaled and normalized via softmax to determine information flow from values.
  • Contextual Embeddings: Output vectors aggregate values weighted by attention scores, producing representations conditioned on the entire sequence.
  • Multi-Head Attention: Parallel attention computations over distinct subspaces capture diverse relational patterns.