Core Insights - The core issue is the significant gap in visual understanding capabilities of multimodal large models when not relying on language prompts, with performance levels comparable to that of a three-year-old child [2][34] - The BabyVision assessment framework dissects visual capabilities into four main categories (fine-grained discrimination, visual tracking, spatial perception, visual pattern recognition) comprising 22 sub-tasks to identify specific weaknesses in model performance [2][34] - Evaluation results reveal a stark contrast between human and model performance, with human baseline accuracy at 94.1%, while the best closed-source model, Gemini3-Pro-Preview, achieved only 49.7%, followed by GPT-5.2 at 34.8%, Doubao-1.8 at 30.2%, and the best open-source model, Qwen3VL-235B-Thinking, at 22.2% [2][34] - A key reason for this disparity is that many tasks cannot be fully expressed in language, leading to the concept of "unspeakable" tasks where critical visual details are lost when compressed into tokens [2][34] - BabyVision introduces a new direction by allowing models to generate visual outputs, with BabyVision-Gen re-labeling 280 tasks suitable for generative responses, achieving a 96% consistency rate with human evaluations [2][34] Assessment Framework - The BabyVision framework aims to break down the understanding of the world into measurable, diagnosable, and iterative atomic capabilities, providing a roadmap for enhancing visual shortcomings in multimodal and embodied intelligence [3][35] - A direct comparison experiment was conducted where 20 vision-centric tasks were given to children of various ages and top multimodal models, revealing that most models scored significantly below the average performance of three-year-old children [4][36] - The only model to consistently exceed the three-year-old baseline was Gemini3-Pro-Preview, which still lagged approximately 20 percentage points behind six-year-old children [4][36] Visual Capability Breakdown - The visual capabilities were categorized into four core areas, each with several sub-tasks: - Fine-grained Discrimination: 8 sub-tasks focused on distinguishing subtle visual differences - Visual Tracking: 5 sub-tasks aimed at following paths, lines, and motion trajectories - Spatial Perception: 5 sub-tasks related to understanding three-dimensional structures and their relationships - Visual Pattern Recognition: 4 sub-tasks for identifying logical and geometric patterns [10][42] - The data collection process involved strict adherence to copyright regulations, ensuring that only suitable images were used, and each question underwent a rigorous double-blind quality check [11][43] Challenges Identified - The research identified four typical challenges faced by models in visual tasks: 1. Non-verbal details: Models struggle with tasks requiring subtle visual distinctions that are easily recognized by humans [14][48] 2. Tracking errors: Models often misinterpret paths and connections, leading to incorrect answers [16][51] 3. Lack of spatial imagination: Models fail to accurately visualize and manipulate three-dimensional structures [19][53] 4. Difficulty in pattern induction: Models tend to focus on superficial attributes rather than underlying structural rules [23][55] Future Directions - BabyVision-Gen represents a promising new approach, allowing models to perform visual reasoning through drawing and tracing, which may help address existing shortcomings [24][60] - The importance of BabyVision lies in its potential to guide the development of multimodal models by identifying gaps in visual understanding and suggesting areas for improvement [29][61]
多模态大模型输给三岁宝宝?xbench x UniPat联合发布新评测集BabyVision
Xin Lang Cai Jing·2026-01-12 01:57