MMLU

Search documents
AI观察|面对“刷分”,大模型测试集到了不得不变的时刻
Huan Qiu Wang· 2025-05-12 09:00
Core Viewpoint - The AI industry is currently engaged in discussions about the adequacy of existing large model testing sets, with a consensus emerging that a new, universally accepted testing framework is needed to accurately assess the capabilities of advanced AI models [1][6]. Group 1: Current State of AI Testing - The article highlights that mainstream AI models have reportedly passed the Turing test, suggesting they meet the standards for Artificial General Intelligence (AGI) [1]. - Existing testing sets, such as MMLU, have been criticized for their inability to effectively evaluate the rapidly evolving capabilities of large models, leading to concerns about their reliability [3][4]. - The emergence of "cheating" practices, where developers manipulate testing sets to achieve higher scores, has further undermined the credibility of current evaluation methods [3][4]. Group 2: New Testing Initiatives - OpenAI has introduced the FrontierMath testing set, which shows significant performance differentiation among models, with the latest o3 model achieving a correct rate of 25%, far surpassing other models [5]. - However, concerns have been raised regarding OpenAI's access to the FrontierMath question database, which has led to questions about the integrity of this testing set [5]. - Industry stakeholders, including Scale AI and CAIS, are collaborating to design a new model testing set that aims to be more reliable and accepted across the board [6].