Workflow
自适应推理努力分配
icon
Search documents
AI“压力面”,DeepSeek性能暴跌近30% | 清华&上海AI Lab
量子位· 2025-07-19 05:15
Core Viewpoint - The article discusses a new "stress test" framework called REST (Reasoning Evaluation through Simultaneous Testing) designed to evaluate the reasoning capabilities of large language models (LLMs) under pressure, revealing significant performance drops, particularly in multi-task scenarios [1][3][20]. Group 1: Stress Test Framework - REST framework allows multiple questions to be presented simultaneously to models, simulating real-world complex reasoning scenarios [2][6]. - The framework was developed by research teams from Shanghai AI Lab, Tsinghua University, and Renmin University of China to address limitations in current evaluation methods [1][6]. Group 2: Performance Findings - Top models, such as DeepSeek-R1, showed a drastic accuracy drop of 29.1% on the AIME24 test set under stress conditions [3][11]. - The performance of various models was significantly affected, with smaller models (7B parameters) deteriorating faster under pressure compared to larger models (32B parameters) [13][19]. Group 3: Evaluation Limitations - Current evaluation methods have three main issues: low differentiation among top models, high costs of developing new test questions, and a lack of realism in testing single questions [5][6]. - REST addresses these issues by combining multiple questions into a single prompt, allowing for a more comprehensive assessment of reasoning abilities [6][20]. Group 4: Key Reasoning Abilities - The stress test evaluates several critical reasoning abilities, including context budget allocation, cross-question interference resistance, and dynamic cognitive load management [7][8][9]. - Models that effectively manage token allocation under pressure tend to perform better, demonstrating adaptive reasoning effort distribution [17][19]. Group 5: Implications for Future Development - The findings suggest that traditional single-question evaluations may overlook significant reasoning flaws, such as question omission and incorrect reasoning summaries [20]. - REST provides a new paradigm for constructing evaluation datasets that are more cost-effective and closer to real-world applications, offering insights for developing more robust LLMs [20].