Workflow
上海交大/上海AI Lab翟广涛:当评测不再重要,AGI就实现了
机器之心·2025-07-15 03:20

Core Viewpoint - The article discusses the challenges and limitations of current AI evaluation systems, emphasizing that a perfect evaluation system would equate to achieving Artificial General Intelligence (AGI) [3][20]. Evaluation System Challenges - The primary issue with evaluation systems is "data contamination," where publicly available benchmark tests are often included in the training data of subsequent models, undermining the diagnostic value of evaluations [5][6]. - The "atomization of capabilities" in evaluations leads to a fragmented understanding of intelligence, as complex skills are broken down into isolated tasks, which may not reflect a model's true capabilities in real-world applications [7][8]. - There is a significant disconnect in embodied intelligence, where models perform well in simulated environments but poorly in real-world scenarios, highlighting the need for more realistic evaluation frameworks [9]. Evaluation Framework and Methodology - The article proposes a "Human-Centered Evaluation" approach, focusing on how models enhance human task efficiency and experience rather than merely comparing model performance against benchmarks [12][13]. - The "EDGE" framework is introduced, which stands for Evolving, Dynamic, Granular, and Ecosystem, aiming to create a responsive evaluation system that adapts to AI advancements [13]. - The team is developing a high-quality internal question bank to mitigate data contamination, planning to gradually open-source questions to ensure reproducibility [15]. Future Directions and Goals - The concept of "training-evaluation integration" is emphasized, where evaluations inform training processes, creating a feedback loop that aligns model development with human preferences [16][17]. - The ultimate goal is to establish a comprehensive evaluation framework that encompasses various aspects of AI, guiding the industry towards a more value-driven and human-centered development path [22][23]. - The article concludes that the success of AI evaluation systems lies in their eventual obsolescence, as achieving AGI would mean that self-evaluation capabilities become inherent to the AI itself [24].