Agentic Visual Reasoning: Enhancing VLMs for Precise Object Counting and Spatial Understanding

Clip title: Vision Models Can’t Count. Here’s the Fix. Author / channel: Prompt Engineering URL: https://www.youtube.com/watch?v=VFYnD1WREdU

Summary

This video introduces an agentic visual reasoning pipeline that significantly enhances the capabilities of Vision Language Models (VLMs) by integrating them with image segmentation models. The main topic revolves around overcoming the limitations of standalone VLMs, such as Google’s recently released Gemma 4, in tasks requiring precise object detection, counting, and spatial understanding. The proposed solution involves combining Gemma 4’s strong reasoning abilities with the precise segmentation power of Falcon Perception, an efficient image segmentation model.

Google’s Gemma 4, available in various sizes and released under an Apache 2.0 license, is highlighted for its efficiency, allowing it to run locally on diverse hardware like mobile devices and personal computers. However, the video demonstrates that while Gemma 4 excels at general scene understanding and speed, it struggles with accurate object counting, providing precise spatial coordinates, and distinguishing individual instances, especially when reasoning about comparative quantities (e.g., “Are there more oranges than apples?”). To address these shortcomings, the project incorporates Falcon Perception, a compact (0.3 billion parameters) image segmentation model from the Technology Innovation Institute, which is noted for its ability to generate high-resolution binary masks and bounding boxes for detected objects.

The core of the solution is an “Agentic Pipeline Architecture” where Gemma 4 acts as a “Plan Router.” Upon receiving a user query and an image, Gemma 4 determines whether specific segmentation tasks are needed. If so, it dispatches tasks to Falcon Perception, which segments and identifies individual instances of objects. The annotated images and detected object data (bounding boxes, masks) are then fed back to Gemma 4 for more accurate visual reasoning, scene analysis, and answering complex queries. This iterative, agentic loop allows the system to perform tasks like accurately counting fruits, identifying dog breeds, and comparing the number of cars and people in a busy street scene, all while providing visual proof of its detections.

The key takeaway is that by intelligently combining a VLM with a dedicated, efficient image segmentation model in an agentic pipeline, AI systems can achieve a superior level of visual understanding and reasoning. This approach overcomes critical limitations of VLMs working in isolation, delivering improved accuracy in counting, precise spatial output, and better instance separation. Importantly, the entire “Gemma Vision Agent” pipeline can run locally on edge devices like Apple Silicon or NVIDIA GPUs, offering a powerful, accessible, and grounded AI solution for complex visual tasks, including potential future applications in real-time object tracking.