Yann LeCun’s JEPA
Joint Embedding Predictive Architecture (JEPA) is a framework for self-supervised learning proposed by yann-lecun as an alternative to next-token prediction models. It aims to learn rich, abstract representations of the world by predicting future observations in a latent space rather than reconstructing raw pixels or predicting the next word.
Core Principles
- Latent Space Prediction: Unlike Contrastive Learning or Reconstruction-based Models, JEPA does not predict raw data (pixels/tokens). It predicts embeddings of future inputs using a separate context encoder.
- Abstraction Gap: The architecture creates a gap between the input encoder and the prediction target, forcing the model to ignore low-level details (noise, texture) and focus on high-level semantic features.
- Contextual Prediction: The model takes a context (e.g., past video frames, previous image patches) and predicts the embedding of the future state, optimizing a loss function on the distance between predicted and actual latent embeddings.
Key Advantages
- Reduced Overfitting: By avoiding reconstruction, JEPA avoids memorizing data statistics, potentially leading to better generalization.
- Semantic Richness: Focuses on learning causal and structural relationships rather than superficial correlations.
- Efficiency: Predicting in a lower-dimensional latent space is computationally cheaper than reconstructing high-dimensional raw data.
Related Concepts
- Self-Supervised Learning
- Representational Learning
- world-models
- large-language-models (comparison to next-token prediction)