6大模型决战高考数学新一卷：豆包、元宝并列第一，OpenAI o3竟惨败垫底

Core Viewpoint - The article discusses the performance of various AI models in tackling high school mathematics exam questions, highlighting the challenges and advancements in AI's reasoning capabilities compared to previous years [3][40]. Group 1: AI Model Performance - The AI models tested include Byte's Doubao, DeepSeek, Alibaba's Tongyi, Tencent's Yuanbao (T1), Baidu's Wenxin X1 Turbo, and OpenAI's o3, with Doubao and Yuanbao achieving the highest scores [8][10]. - Doubao and Yuanbao both scored 68 points, while DeepSeek scored 63 points, and Wenxin X1 Turbo scored 51 points, indicating varying levels of success among the models [10][40]. - OpenAI's o3 performed poorly, scoring only 34 points, which raised concerns about its adaptability to the Chinese high school exam format [11][40]. Group 2: Question Types and Scoring - The mathematics exam consisted of multiple-choice questions, multiple-answer questions, and fill-in-the-blank questions, with specific scoring rules for each type [9][28]. - In the multiple-choice section, Doubao, Tongyi, and Yuanbao scored 35 points each, while DeepSeek scored 30 points, and o3 struggled significantly [16][31]. - For the multiple-answer questions, Doubao, DeepSeek, and Yuanbao achieved full marks, while Wenxin X1 Turbo and o3 faced challenges [28][33]. - In the fill-in-the-blank section, four models scored full marks, demonstrating improved performance in this area compared to previous assessments [34][36]. Group 3: Improvements and Challenges - The AI models showed significant improvement in mathematical reasoning capabilities compared to the previous year, with most models surpassing the passing score of 43.8 points [40]. - Enhanced reflection abilities were noted, as models began to re-evaluate their answers when faced with inconsistencies, a notable advancement from last year's performance [40][41]. - Despite improvements, common issues such as calculation errors, inadequate handling of geometric intuition, and sensitivity to problem conditions were still prevalent among the models [43][44].