Yann LeCun’s Argument: World Models for True, Adaptive AI Beyond LLMs

Generated: 2026-06-14 · API: Gemini 2.5 Flash · Modes: Summary


Yann LeCun’s Argument: World Models for True, Adaptive AI Beyond LLMs

Clip title: Yann LeCun: World Models: Enabling the next AI revolution Author / channel: Computer Vision and Geometry Group, ETH Zurich URL: https://www.youtube.com/watch?v=72Xj8k5WQX4

Summary

Yann LeCun’s presentation, “World Models: Enabling the next AI revolution,” critiques the current state of Artificial Intelligence, particularly Large Language Models (LLMs), arguing that they lack true intelligence and common sense compared to humans and animals. He highlights Moravec’s Paradox, where tasks difficult for humans (like complex math) are easy for AI, while seemingly simple tasks for humans (like navigating a messy room or understanding physics) remain incredibly challenging for machines. LeCun defines intelligence not as an accumulation of declarative knowledge or skills, but as the ability to quickly learn and adapt to new situations with minimal or zero prior training, a capability current AI falls short on.

The core challenge, according to LeCun, lies in how AI systems learn and process real-world data. He points out that while LLMs are trained on trillions of tokens of human-produced text – a data volume that would take a human hundreds of thousands of years to read – a four-year-old child’s sensory experience (vision, touch, etc.) encompasses an equivalent, if not greater, volume of richer, continuous, and highly redundant data. This redundancy, he argues, is a feature critical for self-supervised learning, not a bug. However, current generative models struggle with this continuous, high-dimensional data because there’s an infinite number of plausible future outcomes, leading to blurry or unrealistic predictions when forced to generate every pixel or detail. Unlike humans who build abstract mental models of the world, LLMs operate largely on token-based prediction without genuine understanding of physical reality.

LeCun proposes a paradigm shift towards “World Models” – objective-driven agents that learn causal models of their environment in abstract representation spaces. Instead of merely predicting the next pixel or token (feed-forward prediction), these models would perform inference through search and optimization, predicting the consequences of imagined actions and optimizing action sequences to achieve objectives. His favored architecture for this is the Joint-Embedding Predictive Architecture (JEPA), which learns abstract representations by only predicting the relevant and predictable aspects of future states, discarding irrelevant details. A crucial aspect of training JEPA is preventing “collapse,” a phenomenon where the model learns trivial representations. LeCun advocates for “information maximization” (e.g., using techniques like SIGReg) as a method to encourage representations that are maximally informative and disentangled. This approach resonates with Energy-Based Models (EBMs), which frame learning as shaping an energy landscape where compatible data points have low energy and incompatible ones have high energy.

The speaker presents compelling evidence from his lab’s work on LeWorldModel and V-JEPA, demonstrating how these action-conditioned world models can enable robots to plan in simulated environments and how V-JEPA, trained entirely self-supervised on unlabeled video, learns intuitive physics (object permanence, stability, gravity) and common sense, which is reflected in its prediction error spiking when unphysical events occur. It also shows strong performance in downstream tasks like depth estimation and semantic segmentation. LeCun concludes with a set of provocative recommendations for fellow AI scientists: abandon generative models, probabilistic models, and (for certain tasks) contrastive methods in favor of joint-embedding architectures, energy-based models, and regularized methods, respectively. He also suggests minimizing reliance on reinforcement learning due to its sample inefficiency and, most controversially, states that if researchers are interested in advancing human-level, grounded AI, they should not work on LLMs. Instead, the focus should be on building universal causal models of complex physical systems, including humans themselves, using hierarchical JEPA World Models.

Description

Talk given by Yann LeCun at ETH Zürich during “Frontiers of Embodied AI”.