Workflow
AI评测
icon
Search documents
上海交大/上海AI Lab翟广涛:当评测不再重要,AGI就实现了
机器之心· 2025-07-15 03:20
Core Viewpoint - The article discusses the challenges and limitations of current AI evaluation systems, emphasizing that a perfect evaluation system would equate to achieving Artificial General Intelligence (AGI) [3][20]. Evaluation System Challenges - The primary issue with evaluation systems is "data contamination," where publicly available benchmark tests are often included in the training data of subsequent models, undermining the diagnostic value of evaluations [5][6]. - The "atomization of capabilities" in evaluations leads to a fragmented understanding of intelligence, as complex skills are broken down into isolated tasks, which may not reflect a model's true capabilities in real-world applications [7][8]. - There is a significant disconnect in embodied intelligence, where models perform well in simulated environments but poorly in real-world scenarios, highlighting the need for more realistic evaluation frameworks [9]. Evaluation Framework and Methodology - The article proposes a "Human-Centered Evaluation" approach, focusing on how models enhance human task efficiency and experience rather than merely comparing model performance against benchmarks [12][13]. - The "EDGE" framework is introduced, which stands for Evolving, Dynamic, Granular, and Ecosystem, aiming to create a responsive evaluation system that adapts to AI advancements [13]. - The team is developing a high-quality internal question bank to mitigate data contamination, planning to gradually open-source questions to ensure reproducibility [15]. Future Directions and Goals - The concept of "training-evaluation integration" is emphasized, where evaluations inform training processes, creating a feedback loop that aligns model development with human preferences [16][17]. - The ultimate goal is to establish a comprehensive evaluation framework that encompasses various aspects of AI, guiding the industry towards a more value-driven and human-centered development path [22][23]. - The article concludes that the success of AI evaluation systems lies in their eventual obsolescence, as achieving AGI would mean that self-evaluation capabilities become inherent to the AI itself [24].
《我的世界》成为AI新「考场」?高三生用游戏评测AI:DeepSeek-R1位列第三
3 6 Ke· 2025-03-25 12:45
Core Insights - A high school student, Adi Singh, has developed a new AI evaluation benchmark called MC-Bench, utilizing the game Minecraft to assess AI models' capabilities in a more intuitive manner [1][2][10] - Traditional standardized tests often give AI models an unfair advantage, as they are optimized for specific tasks, leading to discrepancies in real-world performance [2][8] - MC-Bench allows users to vote on AI-generated architectural designs in Minecraft, providing a crowdsourced method for evaluating AI performance [5][9] Group 1: MC-Bench Overview - MC-Bench is designed to evaluate AI models by having them create structures in Minecraft based on user prompts, such as "a crystal-clear wine glass filled with deep red wine" [2][5] - The evaluation process involves user voting to select the best creations, with results revealed only after voting concludes [5][10] - The project has garnered attention from major AI companies like OpenAI, Google, and Anthropic, which provide computational resources but are not officially collaborating [10][13] Group 2: Advantages of Game-Based Evaluation - Minecraft serves as a familiar and visually engaging platform, making it easier for the general public to understand and participate in AI assessments [7][8] - The game environment allows for a controlled testing space, enabling the evaluation of AI's reasoning and planning abilities in a safe manner [7][8] - Game-based assessments can simulate real-world complexities, test AI's decision-making skills, and provide a repeatable environment for comparison [7][8] Group 3: Current Status and Future Plans - As of now, MC-Bench primarily tests basic construction abilities of AI models, tracking their progress since the GPT-3 era [10][16] - Future plans include expanding the benchmark to more complex tasks that require long-term planning and goal-oriented actions [10][16] - The leaderboard of MC-Bench shows that Claude 3.7 Sonnet ranks first, while DeepSeek-R1 is currently in third place, indicating the platform's effectiveness in reflecting user experiences with these models [14][16]