大模型全军覆没，中科院自动化所推出多图数学推理新基准

Core Viewpoint - The MV-MATH dataset, developed by the Institute of Automation, Chinese Academy of Sciences, aims to evaluate the mathematical reasoning capabilities of multimodal large language models (MLLMs) in complex visual scenarios, revealing significant challenges faced by current models in this domain [1][4][33]. Summary by Sections MV-MATH Introduction - MV-MATH consists of 2009 high-quality mathematical problems derived from real K-12 educational scenarios, incorporating multiple images and text to create complex visual contexts [7][8]. - The dataset includes multiple-choice questions, fill-in-the-blank questions, and multi-step questions across 11 mathematical domains, categorized into three difficulty levels [8][9]. Dataset Characteristics - Each problem in MV-MATH features multiple images (2-8), enhancing the complexity of the reasoning tasks [12]. - The dataset ensures quality through cross-validation by at least two annotators, providing detailed annotations for questions, answers, and image relevance [13]. - It covers a wide range of mathematical fields, from basic arithmetic to advanced geometry, allowing for comprehensive evaluation of MLLM reasoning capabilities [15]. - MV-MATH introduces image relevance as a feature, categorizing questions into mutually dependent and independent subsets, highlighting the need for cross-image understanding [16][17]. Performance Evaluation - Extensive experiments were conducted on 24 mainstream multimodal models, revealing that even the most advanced models struggle significantly with multi-visual mathematical tasks, with performance far below human levels [20][21]. - The best-performing model, Claude-3.5, achieved an overall accuracy of 33.9%, while GPT-4o followed with 32.1% [21][22]. - Performance varied across mathematical domains, with Claude-3.5 achieving the highest accuracy of 54.2% in arithmetic but only 27.0% in combinatorial geometry, indicating challenges in complex image understanding [24][25]. Detailed Analysis - The analysis of model performance across different difficulty levels showed that GPT-4o performed best on easy questions (40.3%), while all models significantly dropped in performance on hard questions [27]. - The study found that for closed-source models, Chain of Thought (CoT) and few-shot prompting did not consistently improve performance, and for open-source models, these methods often led to decreased accuracy [28]. - Models performed worse on the mutually dependent image subset compared to the independent subset, with the largest performance gap observed in Gemini-1.5-Pro [30]. - The results indicated that sequential image input outperformed merged input across all tested models, emphasizing the importance of maintaining spatial and sequential information for multi-image reasoning [31]. Conclusion - The research confirms that MLLMs face significant difficulties in complex multi-visual perception and cross-image understanding, highlighting substantial room for improvement in multi-image mathematical reasoning [33].