Core Viewpoint - The article discusses the limitations of current large models in visual understanding, emphasizing that while they excel in language and text reasoning, their visual capabilities remain underdeveloped, akin to that of a three-year-old child [3][4][49]. Group 1: BabyVision Overview - UniPat AI, in collaboration with Sequoia China and various research teams, has launched a new multimodal understanding evaluation set called BabyVision to assess visual capabilities of AI models [3][4]. - BabyVision aims to create a new paradigm for AI training, evaluation, and application in real-world scenarios, focusing on generating measurable and iterative visual capabilities [4][49]. Group 2: Evaluation Methodology - BabyVision includes a direct comparison experiment with 20 vision-centric tasks given to children of different ages (3, 6, 10, 12 years) and top multimodal models [7]. - The evaluation strictly controls language dependency, requiring answers to be derived solely from visual information [8]. Group 3: Results and Findings - The results reveal that most models score significantly below the average performance of three-year-old children, with the best model, Gemini3-Pro-Preview, only achieving 49.7%, which is still 20 percentage points below the performance of six-year-olds [15][21]. - Human participants scored an impressive 94.1% accuracy on the BabyVision-Full test, highlighting the substantial gap between human and model performance [20][21]. Group 4: Challenges Identified - The study identifies four core challenges in visual reasoning for AI models: observing non-verbal details, maintaining visual tracking, lacking spatial imagination, and difficulty in visual pattern induction [27][30][36][39]. - These challenges indicate a systemic lack of foundational visual capabilities in current models, rather than isolated deficiencies [23]. Group 5: Future Directions - The article suggests that transitioning visual reasoning tasks to visual operations, as demonstrated in BabyVision-Gen, may help bridge the gap in visual understanding [42]. - The ongoing development of BabyVision aims to guide the evolution of multimodal large models by breaking down visual understanding into 22 measurable atomic capabilities [49].
顶尖AI竟输给三岁宝宝,BabyVision测试暴露多模态模型硬伤
机器之心·2026-01-12 05:01