Core Insights - The current state of visual reasoning in AI models is still significantly behind human capabilities, with the best model, Gemini 3 Pro Preview, only slightly outperforming a three-year-old child and lagging 20% behind a six-year-old child [2][10] - The performance of Gemini 3 Pro Preview is noted as the highest among existing models, with a score of 49.7%, while other leading models like GPT-5.2 and Claude 4.5 Opus show even poorer results [6][14] - The article emphasizes the need for future models to rebuild visual capabilities from the ground up rather than relying on language-based translations of visual problems [11] Performance Comparison - In closed-source models, Gemini 3 Pro Preview leads with 49.7%, followed by GPT-5.2 at 34.4% and Doubao-Seed-1.8 at 30.2% [14] - Other models such as Qwen3-VL-Plus, Grok-4, and Claude-4.5-Opus scored significantly lower, indicating a general underperformance in visual reasoning tasks [15] - The best-performing open-source model, Qwen3VL-235B-Thinking, achieved a score of 22.2%, still far behind the top closed-source systems [16] Challenges in Visual Reasoning - The article identifies four core challenges faced by multi-modal large language models (MLLMs) in visual reasoning: 1. Lack of Non-verbal Fine Details: MLLMs struggle to accurately describe fine visual details that cannot be easily expressed in language [25] 2. Loss of Manifold Consistency: MLLMs often fail to maintain perceptual consistency over long distances, leading to errors in tasks involving spatial relationships [31] 3. Spatial Imagination: MLLMs have difficulty constructing stable three-dimensional representations from two-dimensional images, which affects their ability to perform mental transformations [39] 4. Visual Pattern Induction: MLLMs tend to focus on counting attributes rather than understanding the underlying changes in visual examples, limiting their ability to generalize from few examples [47] Proposed Solutions - The research suggests two potential directions to improve visual reasoning: 1. Reinforcement Learning with Verifiable Rewards (RLVR): This approach showed an overall accuracy improvement of 4.8 percentage points after fine-tuning, particularly in fine-grained discrimination and spatial perception tasks [56][58] 2. Generative Model Approaches: The study introduces BabyVision-Gen, which evaluates generative models like NanoBanana-Pro, GPT-Image-1.5, and Qwen-Image-Edit, highlighting that while success rates are still low, some models exhibit explicit visual thinking capabilities [60][62] Future Directions - The article concludes that overcoming the "language bottleneck" in visual reasoning is crucial, advocating for unified architectures that retain high-fidelity visual representations during reasoning processes [68][70] - Models like Bagel and Sora 2 demonstrate the potential for generative methods to serve as advanced forms of reasoning, emphasizing the importance of robust visual semantic understanding [71]
最强大模型的视觉能力不如6岁小孩
量子位·2026-01-22 11:13