Workflow
BabyVision
icon
Search documents
顶尖AI竟输给三岁宝宝,BabyVision测试暴露多模态模型硬伤
机器之心· 2026-01-12 05:01
Core Viewpoint - The article discusses the limitations of current large models in visual understanding, emphasizing that while they excel in language and text reasoning, their visual capabilities remain underdeveloped, akin to that of a three-year-old child [3][4][49]. Group 1: BabyVision Overview - UniPat AI, in collaboration with Sequoia China and various research teams, has launched a new multimodal understanding evaluation set called BabyVision to assess visual capabilities of AI models [3][4]. - BabyVision aims to create a new paradigm for AI training, evaluation, and application in real-world scenarios, focusing on generating measurable and iterative visual capabilities [4][49]. Group 2: Evaluation Methodology - BabyVision includes a direct comparison experiment with 20 vision-centric tasks given to children of different ages (3, 6, 10, 12 years) and top multimodal models [7]. - The evaluation strictly controls language dependency, requiring answers to be derived solely from visual information [8]. Group 3: Results and Findings - The results reveal that most models score significantly below the average performance of three-year-old children, with the best model, Gemini3-Pro-Preview, only achieving 49.7%, which is still 20 percentage points below the performance of six-year-olds [15][21]. - Human participants scored an impressive 94.1% accuracy on the BabyVision-Full test, highlighting the substantial gap between human and model performance [20][21]. Group 4: Challenges Identified - The study identifies four core challenges in visual reasoning for AI models: observing non-verbal details, maintaining visual tracking, lacking spatial imagination, and difficulty in visual pattern induction [27][30][36][39]. - These challenges indicate a systemic lack of foundational visual capabilities in current models, rather than isolated deficiencies [23]. Group 5: Future Directions - The article suggests that transitioning visual reasoning tasks to visual operations, as demonstrated in BabyVision-Gen, may help bridge the gap in visual understanding [42]. - The ongoing development of BabyVision aims to guide the evolution of multimodal large models by breaking down visual understanding into 22 measurable atomic capabilities [49].
多模态大模型输给三岁宝宝?xbench x UniPat联合发布新评测集BabyVision
Xin Lang Cai Jing· 2026-01-12 01:57
Core Insights - The core issue is the significant gap in visual understanding capabilities of multimodal large models when not relying on language prompts, with performance levels comparable to that of a three-year-old child [2][34] - The BabyVision assessment framework dissects visual capabilities into four main categories (fine-grained discrimination, visual tracking, spatial perception, visual pattern recognition) comprising 22 sub-tasks to identify specific weaknesses in model performance [2][34] - Evaluation results reveal a stark contrast between human and model performance, with human baseline accuracy at 94.1%, while the best closed-source model, Gemini3-Pro-Preview, achieved only 49.7%, followed by GPT-5.2 at 34.8%, Doubao-1.8 at 30.2%, and the best open-source model, Qwen3VL-235B-Thinking, at 22.2% [2][34] - A key reason for this disparity is that many tasks cannot be fully expressed in language, leading to the concept of "unspeakable" tasks where critical visual details are lost when compressed into tokens [2][34] - BabyVision introduces a new direction by allowing models to generate visual outputs, with BabyVision-Gen re-labeling 280 tasks suitable for generative responses, achieving a 96% consistency rate with human evaluations [2][34] Assessment Framework - The BabyVision framework aims to break down the understanding of the world into measurable, diagnosable, and iterative atomic capabilities, providing a roadmap for enhancing visual shortcomings in multimodal and embodied intelligence [3][35] - A direct comparison experiment was conducted where 20 vision-centric tasks were given to children of various ages and top multimodal models, revealing that most models scored significantly below the average performance of three-year-old children [4][36] - The only model to consistently exceed the three-year-old baseline was Gemini3-Pro-Preview, which still lagged approximately 20 percentage points behind six-year-old children [4][36] Visual Capability Breakdown - The visual capabilities were categorized into four core areas, each with several sub-tasks: - Fine-grained Discrimination: 8 sub-tasks focused on distinguishing subtle visual differences - Visual Tracking: 5 sub-tasks aimed at following paths, lines, and motion trajectories - Spatial Perception: 5 sub-tasks related to understanding three-dimensional structures and their relationships - Visual Pattern Recognition: 4 sub-tasks for identifying logical and geometric patterns [10][42] - The data collection process involved strict adherence to copyright regulations, ensuring that only suitable images were used, and each question underwent a rigorous double-blind quality check [11][43] Challenges Identified - The research identified four typical challenges faced by models in visual tasks: 1. Non-verbal details: Models struggle with tasks requiring subtle visual distinctions that are easily recognized by humans [14][48] 2. Tracking errors: Models often misinterpret paths and connections, leading to incorrect answers [16][51] 3. Lack of spatial imagination: Models fail to accurately visualize and manipulate three-dimensional structures [19][53] 4. Difficulty in pattern induction: Models tend to focus on superficial attributes rather than underlying structural rules [23][55] Future Directions - BabyVision-Gen represents a promising new approach, allowing models to perform visual reasoning through drawing and tracing, which may help address existing shortcomings [24][60] - The importance of BabyVision lies in its potential to guide the development of multimodal models by identifying gaps in visual understanding and suggesting areas for improvement [29][61]
多模态大模型输给三岁宝宝?xbench x UniPat联合发布新评测集BabyVision
红杉汇· 2026-01-12 01:04
Core Insights - The article discusses the advancements in large models in language and text reasoning, highlighting the need for models to understand visual information without relying on language. The introduction of the BabyVision evaluation set aims to assess this capability [1][2]. Group 1: Evaluation of Visual Understanding - BabyVision conducted a direct comparison between children of various ages (3, 6, 10, 12 years) and top multimodal models on 20 vision-centric tasks, revealing that most models scored below the average of 3-year-old children [2][4]. - The only model that consistently exceeded the 3-year-old baseline was Gemini3-Pro-Preview, which still lagged approximately 20 percentage points behind 6-year-old children [4]. Group 2: Breakdown of Visual Abilities - The research team categorized visual abilities into four core categories: Visual Pattern Recognition, Fine-grained Discrimination, Visual Tracking, and Spatial Perception, with a total of 22 sub-tasks designed to quantify foundational visual skills [9][11]. - BabyVision was developed using a rigorous data collection process, referencing children's cognitive materials and visual development tests, resulting in 388 high-quality visual questions [10][11]. Group 3: Performance Results - In the BabyVision-Full evaluation, human participants achieved an accuracy rate of 94.1%, while the best-performing model, Gemini3-Pro-Preview, scored only 49.7%, with most models falling in the 12-19% range [13]. - The performance gap was consistent across all four categories, indicating a systemic lack of foundational visual capabilities in the models [13]. Group 4: Challenges Identified - The article identifies several challenges faced by models, including the inability to process visual information without losing details, leading to errors in tasks that require spatial imagination and visual pattern induction [15][23][26]. - Many tasks in BabyVision are described as "unspeakable," meaning they cannot be fully captured in language without losing critical visual information [15]. Group 5: Future Directions - BabyVision-Gen was introduced to explore whether models can perform visual tasks like children by generating images or videos as answers, showing some improvement in human-like behavior but still lacking consistent accuracy [27][28]. - The importance of BabyVision lies in its ability to break down visual understanding into measurable components, guiding the development of multimodal models towards achieving true general intelligence and embodied intelligence [31].