DeepSeek's AI: Thinking with Visual Primitives for Precise Multimodal Reasoning

Generated: 2026-05-22 · API: Gemini 2.5 Flash · Modes: Summary

DeepSeek’s AI: Thinking with Visual Primitives for Precise Multimodal Reasoning

Clip title: DeepSeek’s New AI Is A Game Changer Author / channel: Two Minute Papers URL: https://www.youtube.com/watch?v=LpXhy2iiaQE

Summary

The video presents a groundbreaking innovation in multimodal AI by DeepSeek, focusing on a novel approach called “Thinking with Visual Primitives.” Unlike traditional AI systems that typically process visual information by describing it, DeepSeek’s method allows the AI to “point” at specific physical coordinates within an image (using bounding boxes or points) during its reasoning process. This mimics how humans interact with visual data, bridging what the researchers term the “Perception Gap” and enabling more precise and less ambiguous understanding of visual layouts and relationships.

This framework offers several key advantages. It significantly enhances accuracy in tasks requiring fine-grained visual understanding, such as counting objects in a complex image, by directly referencing their spatial locations. Furthermore, it boasts remarkable efficiency, using approximately 90% fewer visual tokens compared to other advanced models, leading to faster processing times and reduced computational costs. The system also excels at complex tasks like topological reasoning, demonstrated by its ability to navigate a maze and even visually trace its entire thought process, making the AI’s decision-making more interpretable and transparent.

The technical core of “Thinking with Visual Primitives” is an “On-Policy Distillation” strategy. This involves training a compact “student” model by consolidating knowledge from multiple specialized “expert” AI models, each excelling in a particular visual reasoning skill. DeepSeek’s performance is highly competitive, matching or exceeding that of many “billion-dollar frontier models” on various public benchmarks. The researchers deliberately excluded custom “in-house benchmarks” from their primary evaluation, lending greater credibility to their results by testing against widely accepted standards.

Despite its impressive capabilities, the approach has certain limitations. The AI currently requires explicit “trigger words” to activate its pointing mechanism, rather than autonomously deciding when to use it. Additionally, while bounding boxes improve precision, they may still be insufficient for extremely fine-grained tasks like counting individual strands of hair. The generalization of its topological reasoning to entirely novel situations also has room for improvement. Nevertheless, this research signifies a crucial step toward more precise and understandable multimodal intelligence, advocating for “less is more” in visual processing through refined referential mechanisms, and highlighting the importance of free, open-source AI advancements.

Video Description & Links

Description

❤️ Check out Lambda here and sign up for their GPU Cloud: https://lambda.ai/papers

📝 The paper is available here: https://github.com/ailuntx/Thinking-with-Visual-Primitives https://huggingface.co/datasets/NodeLinker/deepseek-ai-Thinking-with-Visual-Primitives-deleted-repo/blob/main/Thinking_with_Visual_Primitives.pdf

Our Patreon if you wish to support us: https://www.patreon.com/TwoMinutePapers

🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible: Adam Bridges, Benji Rabhan, B Shang, Cameron Navor, Charles Ian Norman Venn, Christian Ahlin, Eric T, Fred R, Gordon Child, Juan Benet, Michael Tedder, Owen Skarpness, Richard Sundvall, Ryan Stankye, Shawn Becker, Steef, Taras Bobrovytsky, Tazaur Sagenclaw, Tybie Fitzhugh, Ueli Gallizzi

My research: https://cg.tuwien.ac.at/~zsolnai/ Thumbnail design: https://felicia.hu

deepseek

URLs

Visual Primitives — Wikipedia
Multimodal Reasoning — Wikipedia
DeepSeek AI — Wikipedia
Bounding Boxes — Wikipedia
Thinking with Visual Primitives — Wikipedia
Perception Gap — Wikipedia
On-Policy Distillation — Wikipedia
Student-Expert Model — Wikipedia
Visual Tokens — Wikipedia
Topological Reasoning — Wikipedia
Spatial Awareness — Wikipedia
Model Interpretability — Wikipedia
Frontier Models — Wikipedia
Open-Source AI — Wikipedia
Fine-Grained Visual Understanding — Wikipedia
Trigger Words — Wikipedia
Computational Efficiency — Wikipedia
Benchmarking — Wikipedia

Two Minute Papers — Wikipedia
DeepSeek — Wikipedia
Gemini 2.5 Flash — Wikipedia
Lambda — Wikipedia
Hugging Face — Wikipedia
GitHub — Wikipedia
Patreon — Wikipedia
Adam Bridges — Wikipedia
Benji Rabhan — Wikipedia
Charles Ian Norman Venn — Wikipedia
Juan Benet — Wikipedia
Felicia — Wikipedia

NemoClaw Knowledge Wiki

Explorer

DeepSeek's AI: Thinking with Visual Primitives for Precise Multimodal Reasoning

DeepSeek’s AI: Thinking with Visual Primitives for Precise Multimodal Reasoning

Summary

Video Description & Links

Description

Tags

URLs

Graph View

Table of Contents

Backlinks

NemoClaw Knowledge Wiki

Explorer

DeepSeek's AI: Thinking with Visual Primitives for Precise Multimodal Reasoning

DeepSeek’s AI: Thinking with Visual Primitives for Precise Multimodal Reasoning

Summary

Video Description & Links

Description

Tags

URLs

Related Concepts

Related Entities

Graph View

Table of Contents

Backlinks