Core Insights - The latest evaluation of multimodal models reveals that most top models perform significantly below the level of a 3-year-old child in visual tasks, with only one model barely exceeding this baseline [1][4][10] Group 1: Evaluation Results - The BabyVision evaluation set was designed to assess the core visual capabilities of large models, with results indicating that the majority of models scored well below the average level of 3-year-old children [1][4] - The best-performing model, Gemini3-Pro-Preview, only managed to exceed the 3-year-old baseline by a small margin, but still lagged approximately 20 percentage points behind 6-year-old children [4][8] Group 2: Model Limitations - The significant disparity in performance is attributed to the models' reliance on language reasoning, which masks their deficiencies in processing visual information [3][10] - The evaluation identified four categories of visual capability where models showed systemic deficiencies: fine discrimination, visual tracking, spatial perception, and visual pattern recognition [10][12] Group 3: Specific Challenges - Models struggle with non-verbal details, leading to a loss of critical visual information when tasks are translated into language descriptions [12][19] - In trajectory tracking tasks, models fail to maintain continuity, often resulting in incorrect conclusions when faced with intersections [14][19] - Spatial imagination is another area of weakness, as models rely on language rather than maintaining a mental representation of three-dimensional structures [14][19] Group 4: Future Directions - The research team suggests that to advance multimodal intelligence, future models must fundamentally rebuild their visual capabilities rather than relying on language reasoning [21]
最新测评集:几乎所有大模型,视觉能力都不如3岁小孩
Guan Cha Zhe Wang·2026-01-12 12:30