多智能体工作流

Search documents
DeepSeek-R1超级外挂!“人类最后的考试”首次突破30分,上海交大等开源方案碾压OpenAI、谷歌
量子位· 2025-07-09 04:57
Core Insights - The article highlights a significant achievement by a domestic team from Shanghai Jiao Tong University and DeepMind Technology, which scored 32.1 points on the "Humanity's Last Exam" (HLE), setting a new record in a notoriously difficult AI test [1][2][26]. Group 1: Achievement and Context - The previous highest score on the HLE was 26.9, achieved by Kimi-Research and Gemini Deep Research [2]. - The HLE was launched earlier this year and is known for its extreme difficulty, with no model scoring above 10 points initially [34][39]. - The test includes over 3,000 questions across various disciplines, with a significant focus on mathematics [39]. Group 2: Methodology and Tools - The team developed two key systems: the tool-enhanced reasoning agent X-Master and the multi-agent workflow system X-Master s [3][20]. - X-Master operates by simulating the dynamic problem-solving process of human researchers, allowing for seamless switching between internal reasoning and external tool usage [9][10]. - The core mechanism involves conceptualizing code as an interactive language, enabling the agent to generate and execute code when faced with unsolvable problems [11][14]. Group 3: Performance Metrics - The X-Masters system achieved a record score of 32.1%, surpassing all existing agents and models [26]. - The performance improvement was attributed to various components of the workflow: tool-enhanced reasoning improved baseline accuracy by 3.4%, iterative optimization added 9.5%, and final selection led to the record score [29][30]. - In specific categories, X-Masters outperformed existing systems, achieving 27.6% accuracy in the biology/medicine category, compared to 17.3% for Biomni and 26% for STELLA [31]. Group 4: Future Implications - The introduction of X-Master s aims to enhance the breadth and depth of reasoning through a decentralized-stacked approach, where multiple agents collaborate to generate and refine solutions [20][22]. - This structured exploration and exploitation strategy is likened to concepts in reinforcement learning, indicating a potential for further advancements in AI reasoning capabilities [23].