Core Insights - The article discusses the advancements in large models in language and text reasoning, highlighting the need for models to understand visual information without relying on language. The introduction of the BabyVision evaluation set aims to assess this capability [1][2]. Group 1: Evaluation of Visual Understanding - BabyVision conducted a direct comparison between children of various ages (3, 6, 10, 12 years) and top multimodal models on 20 vision-centric tasks, revealing that most models scored below the average of 3-year-old children [2][4]. - The only model that consistently exceeded the 3-year-old baseline was Gemini3-Pro-Preview, which still lagged approximately 20 percentage points behind 6-year-old children [4]. Group 2: Breakdown of Visual Abilities - The research team categorized visual abilities into four core categories: Visual Pattern Recognition, Fine-grained Discrimination, Visual Tracking, and Spatial Perception, with a total of 22 sub-tasks designed to quantify foundational visual skills [9][11]. - BabyVision was developed using a rigorous data collection process, referencing children's cognitive materials and visual development tests, resulting in 388 high-quality visual questions [10][11]. Group 3: Performance Results - In the BabyVision-Full evaluation, human participants achieved an accuracy rate of 94.1%, while the best-performing model, Gemini3-Pro-Preview, scored only 49.7%, with most models falling in the 12-19% range [13]. - The performance gap was consistent across all four categories, indicating a systemic lack of foundational visual capabilities in the models [13]. Group 4: Challenges Identified - The article identifies several challenges faced by models, including the inability to process visual information without losing details, leading to errors in tasks that require spatial imagination and visual pattern induction [15][23][26]. - Many tasks in BabyVision are described as "unspeakable," meaning they cannot be fully captured in language without losing critical visual information [15]. Group 5: Future Directions - BabyVision-Gen was introduced to explore whether models can perform visual tasks like children by generating images or videos as answers, showing some improvement in human-like behavior but still lacking consistent accuracy [27][28]. - The importance of BabyVision lies in its ability to break down visual understanding into measurable components, guiding the development of multimodal models towards achieving true general intelligence and embodied intelligence [31].
多模态大模型输给三岁宝宝?xbench x UniPat联合发布新评测集BabyVision
红杉汇·2026-01-12 01:04