Vision-based AI
A subset of Artificial Intelligence focused on the development of systems capable of perceiving, interpreting, and understanding visual data (images, video, and 3D environments) to build world-models.
Core Architectures & Approaches
- Transformer-based Models: Current standard for processing visual tokens.
- JEPA (Joint-Embedding Predictive Architecture):
- A non-generative approach focused on predicting missing parts of a signal in latent space.
- VL-JEPS: A recent development from Meta FAIR Lab representing a shift toward vision-centric intelligence.
- generative-ai (Contrastive Context): While dominant in text (via llm), critics argue generative approaches lack the reasoning depth found in predictive-coding architectures.
Key Research & Philosophical Shifts
- Yann LeCun & Meta Research:
- Advocacy for moving away from purely text-based llm architectures.
- Central Thesis: “Language is not intelligence”; true AGI requires visual-based reasoning and world-modeling.
- Emphasis on architectures that prioritize understanding physical reality over next-token prediction.
- Non-LLM Reasoning: Emerging focus on architectures that utilize visual perception as the primary driver for cognitive development rather than linguistic patterns.
Related Concepts
- Self-Supervised Learning
- computer-vision
- Predictive Coding
- neural-networks
References & Logs
- 2026 04 14 New paper for a vision approach to AGI not LLM
Source Notes
- 2026-04-14: How to get TACK SHARP photos with any camera!