Core Insights - Google DeepMind has introduced a significant new capability called Agentic Vision for Gemini 3 Flash, transforming how large language models understand the world from passive guessing to active investigation [1][3][5]. Technology Overview - Agentic Vision allows models to actively manipulate images based on user requests by employing a "Think-Act-Observe" loop, enhancing the model's ability to analyze and interact with visual data [3][11]. - This capability has resulted in a performance improvement of 5% to 10% across various visual benchmark tests for Gemini 3 Flash [6]. Practical Applications - The technology enables developers to unlock new behaviors through code execution in the API, demonstrated in applications like PlanCheckSolver.com, which improved accuracy by 5% through iterative checks of high-resolution inputs [10]. - Agentic Vision facilitates image annotation, allowing the model to interact with the environment by drawing and labeling directly on images, ensuring pixel-level accuracy in its responses [13]. - The model can also perform visual mathematics and plotting, generating visual representations of data while avoiding common pitfalls of standard large language models [15][16]. Future Prospects - Google indicates that Agentic Vision is just the beginning, with plans to enhance implicit actions like image rotation and visual mathematics in future updates, as well as exploring additional tools for the Gemini model [20]. Competitive Landscape - The release of Agentic Vision coincides with DeepSeek's launch of DeepSeek-OCR2, suggesting a competitive response in the field of visual AI, where both companies are redefining machine vision capabilities [21][22]. - The competition centers around who can better define machine vision, with DeepSeek focusing on perception and Google emphasizing interactive capabilities through code execution [23].
Gemini 3「开眼」像素级操控,谷歌回应DeepSeek-OCR2