Agent微调复活?英伟达开源8B新模型带飞GPT-5:在HLE狂卷37分,还把成本打下来
NvidiaNvidia(US:NVDA) 量子位·2025-12-07 04:35

Core Insights - The article introduces a new paradigm in AI model orchestration, utilizing a smaller 8B model as a conductor to coordinate various tools and larger models, achieving better performance at lower costs [1][13]. Group 1: Model Performance - The Orchestrator-8B model achieved a score of 37.1% in the Humanity's Last Exam, outperforming GPT-5, which scored 35.1%, while also reducing computational costs by 2.5 times [1][9]. - In the FRAMES benchmark, Orchestrator-8B scored 76.3, compared to GPT-5's 74.0, and in the τ²-Bench, it scored 80.2 against GPT-5's 77.7 [9][10]. - The average cost for Orchestrator-8B was only 9.2 cents, with a latency of 8.2 minutes, significantly lower than GPT-5 [9][10]. Group 2: ToolOrchestra Framework - ToolOrchestra integrates various tools into a unified JSON interface, allowing the 8B conductor to think, call, and read feedback in multiple rounds until convergence [4]. - The framework employs GRPO reinforcement learning to maximize three rewards: correctness, efficiency, and user preference [4][5]. Group 3: User Preferences and Biases - The article highlights two biases in large models: self-enhancing bias, where models prefer to call upon similar models, and blind reliance on the strongest models, leading to increased costs [4][5]. - User preferences are taken into account, allowing the conductor to balance between local and cloud searches, speed, and cost [5][15]. Group 4: Application Scenarios - The Orchestrator-8B can be applied in various scenarios, such as internal Q&A and report analysis, where it defaults to local indexing and code execution for 80% of tasks [16]. - In research and development, it can set time and cost limits while considering source preferences [16]. - The framework allows for an end-to-end orchestration of functions and tools, moving away from rigid programming structures [16]. Group 5: Future Directions - The paper has made all code, models, and datasets publicly available for academic and industrial follow-up [14]. - The approach emphasizes a shift from relying solely on the strongest models to a more efficient use of diverse tools and models, enhancing cost-effectiveness and performance [15].