Core Viewpoint - The article evaluates the performance of various AI models in solving high school mathematics exam questions, highlighting both improvements and areas needing enhancement in mathematical reasoning and image recognition capabilities of these models [2][26]. Group 1: Objective Questions Performance - The AI models were tested on 14 objective questions and 5 subjective questions from the 2025 mathematics curriculum, with a total score of 150 points [3][9]. - The models showed similar performance in objective questions, with the highest score difference being only 3 points, while the image-based question (Question 6) posed significant challenges for most models [7][20]. - The scores for the objective questions were generally high, with models like Doubao, Qwen3, Gemini 2.5 Pro, and DeepSeek R1 achieving scores around 68 points, while o3 performed the worst [20][21]. Group 2: Subjective Questions Performance - The subjective questions were identified as a major area of weakness for the models, with only Gemini 2.5 Pro achieving a perfect score of 77 points [8][11]. - Other models like Doubao and DeepSeek R1 lost only one point each, while o3 lost two points, indicating varying levels of performance [8][9]. - The overall scores for subjective questions revealed that models like hunyuan-t1-latest and 文心 X1 Turbo performed poorly, scoring 68 and 66 points respectively [9][11]. Group 3: Image Recognition Challenges - All participating models struggled with the image recognition question (Question 6), indicating a significant shortcoming in their ability to integrate visual and textual information [27]. - The models' failure to accurately interpret the image-based question highlights the need for further development in multi-modal understanding capabilities [26][27]. Group 4: Overall Assessment - The evaluation concluded that while there has been notable progress in the mathematical reasoning abilities of the AI models, substantial improvements are still required, particularly in complex reasoning, rigorous proof, and multi-step calculations [26][28]. - The results suggest that the current AI models have potential but need to address their limitations in both mathematical problem-solving and image recognition to enhance their overall effectiveness [26][27].
高考数学全卷重赛!一道题难倒所有大模型,新选手Gemini夺冠,豆包DeepSeek并列第二
机器之心·2025-06-10 17:56