BabyVision评测集 - filings, earnings calls, financial reports, news

BabyVision评测集

Search documents

Guan Cha Zhe Wang· 2026-01-12 12:30

Core Insights - The latest evaluation of multimodal models reveals that most top models perform significantly below the level of a 3-year-old child in visual tasks, with only one model barely exceeding this baseline [1][4][10] Group 1: Evaluation Results - The BabyVision evaluation set was designed to assess the core visual capabilities of large models, with results indicating that the majority of models scored well below the average level of 3-year-old children [1][4] - The best-performing model, Gemini3-Pro-Preview, only managed to exceed the 3-year-old baseline by a small margin, but still lagged approximately 20 percentage points behind 6-year-old children [4][8] Group 2: Model Limitations - The significant disparity in performance is attributed to the models' reliance on language reasoning, which masks their deficiencies in processing visual information [3][10] - The evaluation identified four categories of visual capability where models showed systemic deficiencies: fine discrimination, visual tracking, spatial perception, and visual pattern recognition [10][12] Group 3: Specific Challenges - Models struggle with non-verbal details, leading to a loss of critical visual information when tasks are translated into language descriptions [12][19] - In trajectory tracking tasks, models fail to maintain continuity, often resulting in incorrect conclusions when faced with intersections [14][19] - Spatial imagination is another area of weakness, as models rely on language rather than maintaining a mental representation of three-dimensional structures [14][19] Group 4: Future Directions - The research team suggests that to advance multimodal intelligence, future models must fundamentally rebuild their visual capabilities rather than relying on language reasoning [21]

Gemini3-Pro-Preview模型

Qwen3VL - 235B - Thinking模型

Gemini3-Pro-Preview模型

Qwen3VL - 235B - Thinking模型