Cross-Attention
Definition
Cross-attention is an attention mechanism in Transformer architectures where query () tensors are derived from a distinct sequence compared to the key () and value () tensors. Enables information exchange between modalities or across encoder-decoder boundaries.
Mechanism
- Formula: .
- Complexity: , where is query length and is source sequence length.
- Contrast: Differs from self-attention where originate from the same input sequence.
Applications
- Encoder-Decoder models: Decoder layers query encoder representations (e.g., T5, BART).
- Multimodal alignment: Text queries attend to image/video patches (e.g., CLIP, Flamingo, llava).
- Diffusion models: UNet blocks condition generation via cross-attention with text embeddings.
- ControlNet/Adapter: Inject auxiliary signals through cross-attention pathways.
Optimizations
- Sparse attention: Prunes or subsets attention interactions to reduce memory/compute.
- Linear attention: Kernelized approximations for near-linear scaling.
- FlashAttention: I/O-aware tiling for efficient dense attention computation.
Recent Developments
- SubQ (Subquadratic, 2026) implements a sparse attention architecture supporting a 12M token context window.
- Reports claim 52x efficiency improvements over standard dense attention baselines.
- Sparse cross-attention strategies may resolve quadratic bottlenecks in ultra-long context or high-resolution multimodal fusion.
- Verification concerns raised regarding context fidelity and efficiency metric reproducibility.
- Reference: SubQ AI: 12M Token Context, Sparse Attention Architecture, and Verification Concerns.