AI评测

Search documents
上海交大/上海AI Lab翟广涛:当评测不再重要,AGI就实现了
机器之心· 2025-07-15 03:20
Core Viewpoint - The article discusses the challenges and limitations of current AI evaluation systems, emphasizing that a perfect evaluation system would equate to achieving Artificial General Intelligence (AGI) [3][20]. Evaluation System Challenges - The primary issue with evaluation systems is "data contamination," where publicly available benchmark tests are often included in the training data of subsequent models, undermining the diagnostic value of evaluations [5][6]. - The "atomization of capabilities" in evaluations leads to a fragmented understanding of intelligence, as complex skills are broken down into isolated tasks, which may not reflect a model's true capabilities in real-world applications [7][8]. - There is a significant disconnect in embodied intelligence, where models perform well in simulated environments but poorly in real-world scenarios, highlighting the need for more realistic evaluation frameworks [9]. Evaluation Framework and Methodology - The article proposes a "Human-Centered Evaluation" approach, focusing on how models enhance human task efficiency and experience rather than merely comparing model performance against benchmarks [12][13]. - The "EDGE" framework is introduced, which stands for Evolving, Dynamic, Granular, and Ecosystem, aiming to create a responsive evaluation system that adapts to AI advancements [13]. - The team is developing a high-quality internal question bank to mitigate data contamination, planning to gradually open-source questions to ensure reproducibility [15]. Future Directions and Goals - The concept of "training-evaluation integration" is emphasized, where evaluations inform training processes, creating a feedback loop that aligns model development with human preferences [16][17]. - The ultimate goal is to establish a comprehensive evaluation framework that encompasses various aspects of AI, guiding the industry towards a more value-driven and human-centered development path [22][23]. - The article concludes that the success of AI evaluation systems lies in their eventual obsolescence, as achieving AGI would mean that self-evaluation capabilities become inherent to the AI itself [24].
《我的世界》成为AI新「考场」?高三生用游戏评测AI:DeepSeek-R1位列第三
3 6 Ke· 2025-03-25 12:45
《我的世界》成为AI新「考场」?高三生用游戏评测AI:DeepSeek-R1位列 第三 如果要衡量 AI 的智能程度,你会怎么做?让它解数学题、写代码,还是让它通过标准化考试?这些方法虽然严谨,但普通人往往难以直观理解 AI 的能 力差异。 近来,一位高三学生 Adi Singh 找到了一个更有趣的办法——用《我的世界》(Minecraft)来评测 AI!他创建了一个名为 MC-Bench 的网站,让不同的 AI 大模型在《我的世界》里建造建筑物,然后由网友投票评选出表现最好的模型。 据悉,这个项目迅速吸引了大量 AI 研究人员和开发者的关注,OpenAI、Google、Anthropic 和阿里巴巴等大型企业虽未直接参与开发,但也为该项目提 供了 AI 计算资源支持。 一名高中生,创建了一种新的 AI 评测基准 如今,研究人员通常会使用标准化测试来评估 AI 模型的表现,但许多测试都给了AI"主场优势"。 由于 AI 模型的训练方式,它们一般较为擅长解决特定、狭窄的问题,尤其是需要死记硬背或简单推理的任务。例如,AI 模型在 LSAT 法律考试、数学推 理测试等标准化考试中得分很高,但在现实应用中仍然容易犯 ...