“几乎所有大模型，视觉能力都不如3岁小孩”

Core Insights - The latest evaluation results from the BabyVision multimodal understanding assessment indicate that most leading multimodal models perform significantly below the level of a 3-year-old child in visual tasks, with only one model barely exceeding the 3-year-old baseline [1][4]. Group 1: Evaluation Results - The BabyVision-Mini test included 20 visual-centric tasks designed to minimize language dependency, with answers requiring solely visual information [4]. - The results showed that most top models scored well below the average level of 3-year-old children, with the best-performing model, Gemini3-Pro-Preview, only slightly surpassing the 3-year-old baseline but still lagging approximately 20 percentage points behind 6-year-olds [4][9]. Group 2: Model Performance - In the BabyVision-Full evaluation, human participants with undergraduate backgrounds achieved an accuracy rate of 94.1%, while the best-performing model, Gemini3-Pro-Preview, only reached 49.7% accuracy [8][9]. - Open-source models performed even worse, with the strongest model scoring below 22.2%, and other models scoring between 12% and 19% [9]. Group 3: Systemic Visual Capability Deficiencies - The evaluation highlighted four major categories of visual capability deficiencies in large models: fine discrimination, visual tracking, spatial perception, and visual pattern recognition, indicating a systemic lack of foundational visual abilities [10]. - The challenges faced by models include the inability to process non-verbal details, difficulties in trajectory tracking, lack of spatial imagination, and issues with inductive reasoning from visual patterns [12][14][16]. Group 4: Implications for Future Development - The research team noted that many test questions possess "unspeakable" characteristics, meaning they cannot be fully expressed in language without losing critical information, which leads to reasoning errors in models [18]. - The team suggests that future models must fundamentally rebuild visual capabilities rather than relying on language reasoning, as a robot with visual abilities below that of a 3-year-old would struggle to assist humans reliably in the physical world [20].