Vision-based AI

A subset of Artificial Intelligence focused on the development of systems capable of perceiving, interpreting, and understanding visual data (images, video, and 3D environments) to build world-models.

Core Architectures & Approaches

  • Transformer-based Models: Current standard for processing visual tokens.
  • JEPA (Joint-Embedding Predictive Architecture):
    • A non-generative approach focused on predicting missing parts of a signal in latent space.
    • VL-JEPS: A recent development from Meta FAIR Lab representing a shift toward vision-centric intelligence.
  • generative-ai (Contrastive Context): While dominant in text (via llm), critics argue generative approaches lack the reasoning depth found in predictive-coding architectures.

Key Research & Philosophical Shifts

  • Yann LeCun & Meta Research:
    • Advocacy for moving away from purely text-based llm architectures.
    • Central Thesis: “Language is not intelligence”; true AGI requires visual-based reasoning and world-modeling.
    • Emphasis on architectures that prioritize understanding physical reality over next-token prediction.
  • Non-LLM Reasoning: Emerging focus on architectures that utilize visual perception as the primary driver for cognitive development rather than linguistic patterns.

References & Logs

  • 2026 04 14 New paper for a vision approach to AGI not LLM

Source Notes

  • 2026-04-14: How to get TACK SHARP photos with any camera!