Workflow
视觉能力
icon
Search documents
最新测评集:几乎所有大模型,视觉能力都不如3岁小孩
Guan Cha Zhe Wang· 2026-01-12 12:30
Core Insights - The latest evaluation of multimodal models reveals that most top models perform significantly below the level of a 3-year-old child in visual tasks, with only one model barely exceeding this baseline [1][4][10] Group 1: Evaluation Results - The BabyVision evaluation set was designed to assess the core visual capabilities of large models, with results indicating that the majority of models scored well below the average level of 3-year-old children [1][4] - The best-performing model, Gemini3-Pro-Preview, only managed to exceed the 3-year-old baseline by a small margin, but still lagged approximately 20 percentage points behind 6-year-old children [4][8] Group 2: Model Limitations - The significant disparity in performance is attributed to the models' reliance on language reasoning, which masks their deficiencies in processing visual information [3][10] - The evaluation identified four categories of visual capability where models showed systemic deficiencies: fine discrimination, visual tracking, spatial perception, and visual pattern recognition [10][12] Group 3: Specific Challenges - Models struggle with non-verbal details, leading to a loss of critical visual information when tasks are translated into language descriptions [12][19] - In trajectory tracking tasks, models fail to maintain continuity, often resulting in incorrect conclusions when faced with intersections [14][19] - Spatial imagination is another area of weakness, as models rely on language rather than maintaining a mental representation of three-dimensional structures [14][19] Group 4: Future Directions - The research team suggests that to advance multimodal intelligence, future models must fundamentally rebuild their visual capabilities rather than relying on language reasoning [21]
“几乎所有大模型,视觉能力都不如3岁小孩”
Guan Cha Zhe Wang· 2026-01-12 12:21
Core Insights - The latest evaluation results from the BabyVision multimodal understanding assessment indicate that most leading multimodal models perform significantly below the level of a 3-year-old child in visual tasks, with only one model barely exceeding the 3-year-old baseline [1][4]. Group 1: Evaluation Results - The BabyVision-Mini test included 20 visual-centric tasks designed to minimize language dependency, with answers requiring solely visual information [4]. - The results showed that most top models scored well below the average level of 3-year-old children, with the best-performing model, Gemini3-Pro-Preview, only slightly surpassing the 3-year-old baseline but still lagging approximately 20 percentage points behind 6-year-olds [4][9]. Group 2: Model Performance - In the BabyVision-Full evaluation, human participants with undergraduate backgrounds achieved an accuracy rate of 94.1%, while the best-performing model, Gemini3-Pro-Preview, only reached 49.7% accuracy [8][9]. - Open-source models performed even worse, with the strongest model scoring below 22.2%, and other models scoring between 12% and 19% [9]. Group 3: Systemic Visual Capability Deficiencies - The evaluation highlighted four major categories of visual capability deficiencies in large models: fine discrimination, visual tracking, spatial perception, and visual pattern recognition, indicating a systemic lack of foundational visual abilities [10]. - The challenges faced by models include the inability to process non-verbal details, difficulties in trajectory tracking, lack of spatial imagination, and issues with inductive reasoning from visual patterns [12][14][16]. Group 4: Implications for Future Development - The research team noted that many test questions possess "unspeakable" characteristics, meaning they cannot be fully expressed in language without losing critical information, which leads to reasoning errors in models [18]. - The team suggests that future models must fundamentally rebuild visual capabilities rather than relying on language reasoning, as a robot with visual abilities below that of a 3-year-old would struggle to assist humans reliably in the physical world [20].
GPT-5.2来了,首个“专家级”AI复仇成功,牛马打工人终于得救了
3 6 Ke· 2025-12-11 23:58
Core Insights - OpenAI has launched GPT-5.2, which is positioned as the most powerful general-purpose AI model, designed to tackle complex knowledge-based tasks effectively [1][4]. Model Overview - Three versions of GPT-5.2 have been released: GPT-5.2 Instant, GPT-5.2 Thinking, and GPT-5.2 Pro [2]. - GPT-5.2 has shown significant improvements over its predecessor, GPT-5.1, in areas such as general intelligence, long text comprehension, tool utilization, and visual capabilities [6]. Performance Metrics - In various benchmarks, GPT-5.2 has achieved remarkable results: - SWE-Bench Pro: 55.6% accuracy, a 4.8% increase from GPT-5.1 [7]. - ARC-AGI-2: 52.9% accuracy, outperforming all competitors [7]. - GDPval: 70.9% of tasks completed successfully, surpassing human industry experts [11][27]. - The model's performance in investment banking tasks has improved by 9.3%, with scores rising from 59.1% to 68.4% [33]. Context and Knowledge Updates - GPT-5.2 features a context window of 400,000 tokens and a maximum output length of 128,000 tokens, allowing for extensive text processing [19]. - The knowledge base has been updated to include information up to August 31, 2025, ensuring the model is equipped with the latest data [19]. Cost Implications - The pricing for GPT-5.2 has increased by 40% compared to GPT-5.1, reflecting the enhanced capabilities and computational costs associated with the new model [19][20]. Competitive Landscape - The release of GPT-5.2 comes amid competition with Google's Gemini 3, although OpenAI executives have stated that the launch was not a direct response to this competitor [21]. - GPT-5.2 is marketed as the best model for professional knowledge work, capable of outperforming human experts in various tasks [25][29].