Self-Supervised Learning & Joint Embedding Predictive Architecture

Self-Supervised Learning (SSL) is a paradigm where models learn representations from unlabeled data by generating supervisory signals from the data structure itself. A prominent SSL framework is the Joint Embedding Predictive Architecture (JEPA), proposed by yann-lecun for training world-models.

Joint Embedding Predictive Architecture (JEPA)

JEPA predicts future states or missing context within an abstract embedding space, rather than reconstructing raw data or predicting sequential tokens.

Core Mechanics

  • Latent Prediction: Context encoder processes observed inputs to generate representations; predictor network forecasts representations of target inputs (future or masked) in the latent space.
  • No Reconstruction: Loss functions operate solely on embeddings, avoiding the high-dimensional noise and computational waste associated with pixel or token-level reconstruction.
  • Modularity: Supports diverse modalities (vision, text, sensor data) by mapping inputs to a unified representation space before prediction.

Comparison to LLMs

  • JEPA targets inefficiencies in large-language-models by modeling state transitions and causal structures directly, rather than relying on statistical co-occurrence of tokens.