小学数学题，大模型集体不及格！达摩院推出新基准VCBench

Core Insights - The article discusses the limitations of large models in understanding basic mathematical principles, despite their strong performance in solving math problems. Human average scores were 93.30%, while the best-performing models (Gemini2.0-Flash, Qwen-VL-Max, and Claude-3.7-Sonnet) scored below 50% accuracy [1][17]. Group 1: Model Performance - The performance of large models on elementary math problems is surprisingly low, with the best closed-source models achieving scores of 49.77%, 47.03%, and 46.63% respectively, all below the 50% threshold [1][17]. - Open-source models generally performed worse than closed-source models, indicating variability likely due to architectural differences, multimodal integration, or training data quality [17]. - The models excelled in reasoning and pattern recognition tasks but struggled significantly with spatial and geometric reasoning, highlighting a gap in visual and geometric perception capabilities [17]. Group 2: VCBench Overview - VCBench is a new benchmark designed to evaluate multimodal mathematical reasoning tasks with explicit visual dependencies, specifically targeting elementary school math problems [4][5]. - The benchmark emphasizes vision-centric assessments rather than knowledge-centric ones, aligning with children's learning paths where visual reasoning precedes domain-specific knowledge acquisition [8][10]. - VCBench includes an average of 3.9 images per question, requiring models to integrate visual cues from multiple images, reflecting real-world scenarios where information is often dispersed across various visual inputs [12]. Group 3: Cognitive Skills Assessment - VCBench evaluates various cognitive skills across six core domains: time and calendar, spatial awareness, geometry, objects and motion, reasoning and observation, and organization and pattern [14]. - It also assesses five different reasoning abilities: temporal reasoning, geometric reasoning, logical reasoning, spatial reasoning, and pattern recognition [15]. Group 4: Error Analysis - Visual perception errors are the most significant weakness across all models, with over 50% of errors attributed to this category, indicating a fundamental limitation in current multimodal models [27]. - Calculation errors range from 4-7%, while context misunderstanding errors are generally low, suggesting that models perform better in direct visual tasks than in complex reasoning scenarios [27]. - Logical reasoning capabilities vary significantly among models, with Claude exhibiting the highest rate of logical errors at 33%, indicating instability in reasoning performance [29].