Contextual Embeddings

Dynamic vector representations of tokens where the embedding varies based on the surrounding sequence context, enabling resolution of polysemy and long-range dependencies. Unlike static embeddings, these are computed on-the-fly by the model’s architecture.

Mechanism & Properties

  • Transformer Architecture: Contextual embeddings emerge from [[concepts/self-attention]] layers within Transformer models. Each layer refines token representations by aggregating information from other positions in the sequence.
  • QKV Computation: The Attention Mechanism projects inputs into Query, Key, and Value spaces. Attention scores are derived from dot products of Q and K, normalized via softmax, and applied to V to compute weighted context aggregations.
  • Dynamic Representation: A single token yields distinct vectors depending on neighbors, capturing semantic nuance absent in fixed-lookup embeddings.
  • Layer-wise Evolution: Contextual depth increases through stacked layers; early layers capture local syntax/adjacency, while deeper layers model global semantics and abstract relationships.
  • Visual Intuition: 3Blue1Brown’s breakdown clarifies how attention weights function as dynamic focus mechanisms to construct rich, position-aware embeddings.

Sources & Notes