Yann LeCun’s JEPA

Joint Embedding Predictive Architecture (JEPA) is a framework for self-supervised learning proposed by yann-lecun as an alternative to next-token prediction models. It aims to learn rich, abstract representations of the world by predicting future observations in a latent space rather than reconstructing raw pixels or predicting the next word.

Core Principles

Latent Space Prediction: Unlike Contrastive Learning or Reconstruction-based Models, JEPA does not predict raw data (pixels/tokens). It predicts embeddings of future inputs using a separate context encoder.
Abstraction Gap: The architecture creates a gap between the input encoder and the prediction target, forcing the model to ignore low-level details (noise, texture) and focus on high-level semantic features.
Contextual Prediction: The model takes a context (e.g., past video frames, previous image patches) and predicts the embedding of the future state, optimizing a loss function on the distance between predicted and actual latent embeddings.

Reduced Overfitting: By avoiding reconstruction, JEPA avoids memorizing data statistics, potentially leading to better generalization.
Semantic Richness: Focuses on learning causal and structural relationships rather than superficial correlations.
Efficiency: Predicting in a lower-dimensional latent space is computationally cheaper than reconstructing high-dimensional raw data.