Generated: 2026-05-22 · API: Gemini 2.5 Flash · Modes: Summary
DeepSeek’s AI: Thinking with Visual Primitives for Precise Multimodal Reasoning
Clip title: DeepSeek’s New AI Is A Game Changer Author / channel: Two Minute Papers URL: https://www.youtube.com/watch?v=LpXhy2iiaQE
Summary
The video presents a groundbreaking innovation in multimodal AI by DeepSeek, focusing on a novel approach called “Thinking with Visual Primitives.” Unlike traditional AI systems that typically process visual information by describing it, DeepSeek’s method allows the AI to “point” at specific physical coordinates within an image (using bounding boxes or points) during its reasoning process. This mimics how humans interact with visual data, bridging what the researchers term the “Perception Gap” and enabling more precise and less ambiguous understanding of visual layouts and relationships.
This framework offers several key advantages. It significantly enhances accuracy in tasks requiring fine-grained visual understanding, such as counting objects in a complex image, by directly referencing their spatial locations. Furthermore, it boasts remarkable efficiency, using approximately 90% fewer visual tokens compared to other advanced models, leading to faster processing times and reduced computational costs. The system also excels at complex tasks like topological reasoning, demonstrated by its ability to navigate a maze and even visually trace its entire thought process, making the AI’s decision-making more interpretable and transparent.
The technical core of “Thinking with Visual Primitives” is an “On-Policy Distillation” strategy. This involves training a compact “student” model by consolidating knowledge from multiple specialized “expert” AI models, each excelling in a particular visual reasoning skill. DeepSeek’s performance is highly competitive, matching or exceeding that of many “billion-dollar frontier models” on various public benchmarks. The researchers deliberately excluded custom “in-house benchmarks” from their primary evaluation, lending greater credibility to their results by testing against widely accepted standards.
Despite its impressive capabilities, the approach has certain limitations. The AI currently requires explicit “trigger words” to activate its pointing mechanism, rather than autonomously deciding when to use it. Additionally, while bounding boxes improve precision, they may still be insufficient for extremely fine-grained tasks like counting individual strands of hair. The generalization of its topological reasoning to entirely novel situations also has room for improvement. Nevertheless, this research signifies a crucial step toward more precise and understandable multimodal intelligence, advocating for “less is more” in visual processing through refined referential mechanisms, and highlighting the importance of free, open-source AI advancements.
Video Description & Links
Description
❤️ Check out Lambda here and sign up for their GPU Cloud: https://lambda.ai/papers
📝 The paper is available here: https://github.com/ailuntx/Thinking-with-Visual-Primitives https://huggingface.co/datasets/NodeLinker/deepseek-ai-Thinking-with-Visual-Primitives-deleted-repo/blob/main/Thinking_with_Visual_Primitives.pdf
Our Patreon if you wish to support us: https://www.patreon.com/TwoMinutePapers
🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible: Adam Bridges, Benji Rabhan, B Shang, Cameron Navor, Charles Ian Norman Venn, Christian Ahlin, Eric T, Fred R, Gordon Child, Juan Benet, Michael Tedder, Owen Skarpness, Richard Sundvall, Ryan Stankye, Shawn Becker, Steef, Taras Bobrovytsky, Tazaur Sagenclaw, Tybie Fitzhugh, Ueli Gallizzi
My research: https://cg.tuwien.ac.at/~zsolnai/ Thumbnail design: https://felicia.hu
Tags
ai, deepseek, deepseek ai
URLs
- https://lambda.ai/papers
- https://github.com/ailuntx/Thinking-with-Visual-Primitives
- https://huggingface.co/datasets/NodeLinker/deepseek-ai-Thinking-with-Visual-Primitives-deleted-repo/blob/main/Thinking_with_Visual_Primitives.pdf
- https://www.patreon.com/TwoMinutePapers
- https://cg.tuwien.ac.at/~zsolnai/
- https://felicia.hu
Related Concepts
- Visual Primitives — Wikipedia
- Multimodal Reasoning — Wikipedia
- DeepSeek AI — Wikipedia
- Bounding Boxes — Wikipedia
- Thinking with Visual Primitives — Wikipedia
- Perception Gap — Wikipedia
- On-Policy Distillation — Wikipedia
- Student-Expert Model — Wikipedia
- Visual Tokens — Wikipedia
- Topological Reasoning — Wikipedia
- Spatial Awareness — Wikipedia
- Model Interpretability — Wikipedia
- Frontier Models — Wikipedia
- Open-Source AI — Wikipedia
- Fine-Grained Visual Understanding — Wikipedia
- Trigger Words — Wikipedia
- Computational Efficiency — Wikipedia
- Benchmarking — Wikipedia