Workflow
分层解法修复
icon
Search documents
HLE“人类最后考试”首次突破60分!Eigen-1基于DeepSeek V3.1显著领先Grok4、GPT-5
量子位· 2025-09-28 11:54
Core Insights - The article highlights a significant breakthrough in AI capabilities with the Eigen-1 multi-agent system achieving a Pass@1 accuracy of 48.3% and Pass@5 accuracy of 61.74% on the HLE Bio/Chem Gold test set, surpassing major competitors like Google Gemini 2.5 Pro and OpenAI GPT-5 [1][5][39]. Technical Innovations - The success of Eigen-1 is attributed to three innovative mechanisms: Monitor-based RAG, Hierarchical Solution Refinement (HSR), and Quality-Aware Iterative Reasoning (QAIR) [3][15][20]. - Monitor-based RAG reduces the "tool tax" associated with traditional retrieval-augmented generation systems, leading to a 53.5% reduction in token consumption and a 43.7% decrease in workflow iterations while maintaining higher accuracy [11][12][37]. - HSR introduces a hierarchical collaboration model that allows stronger solutions to absorb valuable insights from weaker ones, enhancing the overall problem-solving process [15][18]. - QAIR optimizes the iterative reasoning process by adjusting the depth of exploration based on the quality of answers, ensuring efficient resource utilization [20][21]. Performance Metrics - Eigen-1's performance metrics indicate a significant lead over competitors, with Pass@1 and Pass@5 scores of 48.3% and 61.74% respectively in HLE Bio/Chem Gold, and also strong performances in SuperGPQA Hard and TRQA tasks [27][22]. - The article provides a comparative table showcasing the performance of various models, highlighting Eigen-1's superior results [22]. Insights on Error Patterns - Analysis reveals that 92.78% of errors stem from reasoning process issues, indicating that the core challenge lies in seamlessly integrating knowledge with reasoning rather than mere knowledge retrieval [24][25]. - The article notes that execution and understanding errors are relatively low, suggesting that models have matured in instruction comprehension [26]. Component Contribution Analysis - The team conducted ablation studies to quantify the contributions of each component, demonstrating that the baseline system achieved only 25.3% accuracy without external knowledge, while the full system reached 48.3% accuracy with efficient token usage [29][31]. Implications for AI in Science - The breakthrough signifies a new paradigm for AI-assisted scientific research, suggesting that AI can become a powerful ally for scientists in tackling complex problems [39][40]. - The research team plans to continue optimizing the architecture and exploring applications in other scientific fields, indicating a commitment to advancing AI capabilities in research workflows [42].