Cross-Attention

Definition

Cross-attention is an attention mechanism in Transformer architectures where query () tensors are derived from a distinct sequence compared to the key () and value () tensors. Enables information exchange between modalities or across encoder-decoder boundaries.

Mechanism

  • Formula: .
  • Complexity: , where is query length and is source sequence length.
  • Contrast: Differs from self-attention where originate from the same input sequence.

Applications

  • Encoder-Decoder models: Decoder layers query encoder representations (e.g., T5, BART).
  • Multimodal alignment: Text queries attend to image/video patches (e.g., CLIP, Flamingo, llava).
  • Diffusion models: UNet blocks condition generation via cross-attention with text embeddings.
  • ControlNet/Adapter: Inject auxiliary signals through cross-attention pathways.

Optimizations

  • Sparse attention: Prunes or subsets attention interactions to reduce memory/compute.
  • Linear attention: Kernelized approximations for near-linear scaling.
  • FlashAttention: I/O-aware tiling for efficient dense attention computation.

Recent Developments