Workflow
科学推理
icon
Search documents
HLE“人类最后考试”首次突破60分,Eigen-1基于DeepSeek V3.1显著领先Grok4、GPT-5
3 6 Ke· 2025-09-28 12:05
Core Insights - Eigen-1 multi-agent system has achieved a historic breakthrough with Pass@1 accuracy of 48.3% and Pass@5 accuracy of 61.74% on the HLE Bio/Chem Gold test set, surpassing competitors like Google Gemini 2.5 Pro and OpenAI GPT-5 [1][6][27] - The success is attributed to three innovative mechanisms: Monitor-based RAG, Hierarchical Solution Refinement (HSR), and Quality-Aware Iterative Reasoning (QAIR) [2][5][12] Technical Innovations - **Monitor-based RAG**: This mechanism eliminates the "tool tax" associated with traditional retrieval-augmented generation systems by continuously monitoring reasoning flow and seamlessly integrating retrieved knowledge, resulting in a 53.5% reduction in token consumption and a 43.7% decrease in workflow iterations [8][10] - **Hierarchical Solution Refinement (HSR)**: HSR introduces a hierarchical collaboration model that allows stronger solutions to absorb valuable insights from weaker ones, enhancing the overall quality of the output [12][15] - **Quality-Aware Iterative Reasoning (QAIR)**: This mechanism adapts the depth of iterations based on the quality of answers, ensuring efficient resource utilization by focusing on low-quality candidates for further exploration [15][18] Performance Metrics - Eigen-1's performance metrics demonstrate its superiority across various benchmarks, achieving Pass@1 of 48.3% and Pass@5 of 61.74% on HLE Bio/Chem Gold, and significantly higher scores on SuperGPQA Hard and TRQA [17] - The model's accuracy improved from 25.3% to 48.3% through the integration of various components, showcasing the effectiveness of the innovative mechanisms [20][21] Insights on Error Patterns - Analysis reveals that 92.78% of errors stem from reasoning process issues, indicating that the core challenge lies in integrating knowledge with reasoning rather than mere knowledge retrieval [18] Implications for AI in Science - The breakthrough signifies a new paradigm for AI-assisted scientific research, suggesting that AI can effectively understand and reason through complex human knowledge, thus accelerating the research process [27]
HLE“人类最后考试”首次突破60分!Eigen-1基于DeepSeek V3.1显著领先Grok4、GPT-5
量子位· 2025-09-28 11:54
Core Insights - The article highlights a significant breakthrough in AI capabilities with the Eigen-1 multi-agent system achieving a Pass@1 accuracy of 48.3% and Pass@5 accuracy of 61.74% on the HLE Bio/Chem Gold test set, surpassing major competitors like Google Gemini 2.5 Pro and OpenAI GPT-5 [1][5][39]. Technical Innovations - The success of Eigen-1 is attributed to three innovative mechanisms: Monitor-based RAG, Hierarchical Solution Refinement (HSR), and Quality-Aware Iterative Reasoning (QAIR) [3][15][20]. - Monitor-based RAG reduces the "tool tax" associated with traditional retrieval-augmented generation systems, leading to a 53.5% reduction in token consumption and a 43.7% decrease in workflow iterations while maintaining higher accuracy [11][12][37]. - HSR introduces a hierarchical collaboration model that allows stronger solutions to absorb valuable insights from weaker ones, enhancing the overall problem-solving process [15][18]. - QAIR optimizes the iterative reasoning process by adjusting the depth of exploration based on the quality of answers, ensuring efficient resource utilization [20][21]. Performance Metrics - Eigen-1's performance metrics indicate a significant lead over competitors, with Pass@1 and Pass@5 scores of 48.3% and 61.74% respectively in HLE Bio/Chem Gold, and also strong performances in SuperGPQA Hard and TRQA tasks [27][22]. - The article provides a comparative table showcasing the performance of various models, highlighting Eigen-1's superior results [22]. Insights on Error Patterns - Analysis reveals that 92.78% of errors stem from reasoning process issues, indicating that the core challenge lies in seamlessly integrating knowledge with reasoning rather than mere knowledge retrieval [24][25]. - The article notes that execution and understanding errors are relatively low, suggesting that models have matured in instruction comprehension [26]. Component Contribution Analysis - The team conducted ablation studies to quantify the contributions of each component, demonstrating that the baseline system achieved only 25.3% accuracy without external knowledge, while the full system reached 48.3% accuracy with efficient token usage [29][31]. Implications for AI in Science - The breakthrough signifies a new paradigm for AI-assisted scientific research, suggesting that AI can become a powerful ally for scientists in tackling complex problems [39][40]. - The research team plans to continue optimizing the architecture and exploring applications in other scientific fields, indicating a commitment to advancing AI capabilities in research workflows [42].
AI攻克物理奥赛,王梦迪团队打造Physics Supernova智能体,超过人类金牌选手平均分
3 6 Ke· 2025-09-16 08:20
在学科竞赛领域,物理因题目复杂、推理强度高而长期被认为是人工智能(AI)最难攻克的挑战之一。与语言类任务相比,物理问题往往涉及图像识 别、单位换算、公式推导和近似计算等多重环节,更考验系统是否具备对现实世界的理解与建模能力。 随着 AI 日益深入现实世界,并不断迈向通用人工智能(AGI)乃至超级人工智能(ASI),能否通过物理抽象理解世界、解决问题,正在成为打造高水平 智能系统的关键。 在今年举行的 2025 年国际物理奥林匹克竞赛中,一个名为 Physics Supernova 的 AI 系统交出了令人瞩目的成绩单:在 3 道理论题测试中,共获得 23.5 分 (满分 30 分),在所有 406 名参赛选手中排名第 14,且在三道题目中均进入人类前 10%,超过了人类金牌选手的平均得分。 该系统由普林斯顿大学王梦迪教授团队及其合作者共同打造,两位第一作者分别为普林斯顿大学博士 Jiahao Qiu 和清华姚班大四本科生史景喆(在 2021 年国际物理奥林匹克竞赛中获得金牌,全球排名第十)。 论文链接:https://arxiv.org/abs/2509.01659 靠工具,AI 也能像物理学家一样解题 Ph ...
史上最大高质量科学推理后训练数据集开源,快速让Qwen3等变“科学家”
量子位· 2025-08-09 07:01
Core Viewpoint - The release of MegaScience, a large-scale open-source dataset for scientific reasoning, aims to enhance the training and evaluation of general artificial intelligence systems in scientific domains, addressing the lack of high-quality training data in scientific reasoning tasks [1][9][15]. Group 1: Dataset Overview - MegaScience consists of approximately 1.25 million question-answer pairs across various disciplines, including biology, chemistry, computer science, economics, mathematics, medicine, and physics [1][15]. - The dataset has been downloaded over 4,600 times within a week of its release and ranks fourth on the HuggingFace Datasets Trending list, indicating significant interest from the academic and industrial research communities [7]. Group 2: Performance and Evaluation - Models trained on MegaScience significantly outperform corresponding official Instruct models in scientific reasoning tasks, demonstrating the dataset's effectiveness [3][16]. - The dataset exhibits good scalability, with performance gains becoming more pronounced as the size of the base models increases [3][16]. Group 3: Challenges Addressed - Existing scientific reasoning datasets face challenges such as unreliable benchmark evaluations, inadequate decontamination processes, low-quality reference answers, and superficial knowledge distillation [10][11][13]. - MegaScience addresses these challenges through a systematic approach, including the development of a comprehensive scientific reasoning evaluation framework and rigorous data decontamination processes [13][15]. Group 4: Data Construction Process - The construction of MegaScience involved collecting data from multiple public datasets, implementing deduplication and decontamination strategies, and applying various data selection techniques to ensure high-quality outputs [27][28][30]. - The TextbookReasoning dataset, a component of MegaScience, was created using a fully automated process that extracted and refined question-answer pairs from approximately 120,000 university-level textbooks [14][19][20]. Group 5: Evaluation Framework - The evaluation framework for MegaScience includes 15 representative benchmark tasks designed to comprehensively assess the scientific reasoning capabilities of language models [37][39]. - The framework optimizes answer extraction processes to enhance the accuracy of evaluation results, ensuring a fair comparison between models [39][41]. Group 6: Future Prospects - Future research may explore the integration of reinforcement learning with MegaScience to further enhance scientific reasoning capabilities, leveraging the high-quality reference answers provided by the dataset [47][48].