Cross-Attention

Definition

Cross-attention is an attention mechanism in Transformer architectures where query ( $Q$ ) tensors are derived from a distinct sequence compared to the key ( $K$ ) and value ( $V$ ) tensors. Enables information exchange between modalities or across encoder-decoder boundaries.

Mechanism

Formula: $Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$ .
Complexity: $O (N \cdot M \cdot d)$ , where $N$ is query length and $M$ is source sequence length.
Contrast: Differs from self-attention where $Q, K, V$ originate from the same input sequence.

Applications

Encoder-Decoder models: Decoder layers query encoder representations (e.g., T5, BART).
Multimodal alignment: Text queries attend to image/video patches (e.g., CLIP, Flamingo, llava).
Diffusion models: UNet blocks condition generation via cross-attention with text embeddings.
ControlNet/Adapter: Inject auxiliary signals through cross-attention pathways.

Optimizations

Sparse attention: Prunes or subsets attention interactions to reduce memory/compute.
Linear attention: Kernelized approximations for near-linear scaling.
FlashAttention: I/O-aware tiling for efficient dense attention computation.

Recent Developments

SubQ (Subquadratic, 2026) implements a sparse attention architecture supporting a 12M token context window.
Reports claim 52x efficiency improvements over standard dense attention baselines.
Sparse cross-attention strategies may resolve quadratic bottlenecks in ultra-long context or high-resolution multimodal fusion.
Verification concerns raised regarding context fidelity and efficiency metric reproducibility.
Reference: SubQ AI: 12M Token Context, Sparse Attention Architecture, and Verification Concerns.

NemoClaw Knowledge Wiki

Explorer

cross-attention

Cross-Attention

Definition

Mechanism

Applications

Optimizations

Recent Developments

Graph View

Table of Contents

Backlinks