Core Insights - The current state of visual reasoning in AI models, particularly Gemini 3 Pro Preview, is still significantly below human capabilities, with a performance level comparable to a three-year-old child, and a 20% gap from six-year-olds [1][7][4] - Gemini 3 Pro Preview is considered the leading model among existing AI systems, outperforming others like GPT-5.2 and Claude 4.5 Opus, which perform even worse than a three-year-old [5][10] - The research highlights the limitations of current visual reasoning models, emphasizing the need for a fundamental reconstruction of visual capabilities rather than relying on language-based translations [7][19] Performance Comparison - In closed-source models, Gemini 3 Pro Preview leads with a score of 49.7%, followed by GPT-5.2 at 34.4% and Doubao-Seed-1.8 at 30.2% [10] - Other models such as Qwen3-VL-Plus, Grok-4, and Claude-4.5 Opus scored significantly lower, with scores of 19.2%, 16.2%, and 14.2% respectively [11] - The best-performing open-source model, Qwen3VL-235B-Thinking, achieved a score of 22.2%, indicating that even the largest open-source models cannot compete with top closed-source systems [12][13] Challenges in Visual Reasoning - The research identifies four core challenges faced by multi-modal large language models (MLLMs) in visual reasoning: 1. Fine-grained Discrimination: Difficulty in detecting subtle visual differences [19] 2. Visual Tracking: Inability to maintain perceptual consistency over long distances [22] 3. Spatial Perception: Challenges in constructing stable three-dimensional representations from two-dimensional images [28] 4. Visual Pattern Recognition: Struggles in generalizing rules from limited visual examples [34] Proposed Solutions - The study suggests two potential directions for improving visual reasoning capabilities: 1. Reinforcement Learning with Verifiable Rewards (RLVR): This approach showed an overall accuracy improvement of approximately 4.8 percentage points after fine-tuning, particularly in fine-grained discrimination and spatial perception tasks [36] 2. Generative Modeling: The introduction of BabyVision-Gen evaluated three advanced visual generative models, with NanoBanana-Pro achieving the highest accuracy of 18.3% [38][39] Future Trends - The research indicates a shift towards unified architectures that bypass the "language bottleneck," allowing for high-fidelity visual representations during reasoning processes [44] - Models like Bagel, Sora 2, and Veo 3 demonstrate the potential for generative methods to serve as advanced forms of reasoning, emphasizing the importance of maintaining visual integrity in AI systems [44]
最强大模型的视觉能力不如6岁小孩
3 6 Ke·2026-01-22 13:10