智能体训练
Search documents
训练奖励太稀疏?港中文联合美团给Agent加上「过程分」
机器之心· 2026-02-19 23:43
Core Insights - The article discusses the limitations of traditional reward systems in training agents, which often only consider the final outcome, neglecting the complexity of the reasoning process involved in multi-step tasks [2][3][6]. - The Reagent framework aims to address this issue by providing detailed feedback on the entire reasoning process, rather than just the final answer, thus enhancing the training of agents [5][10][12]. Group 1: Problem Identification - Agents require long-term, granular feedback, but most existing systems only provide coarse-grained rewards based on final outcomes [3]. - The traditional approach fails to differentiate between slightly successful attempts and completely misguided efforts, leading to a lack of valuable learning opportunities [2][6]. Group 2: Solution Development - The authors developed a reasoning reward model (Agent-RRM) that evaluates the entire trajectory of an agent's reasoning process, providing scores and critiques [10][11]. - This model outputs an internal analysis, a critique for the agent, and an overall score, allowing for a more nuanced understanding of the agent's performance [10][11]. Group 3: Implementation of Reagent Framework - The Reagent framework integrates textual critiques and scoring into the training process, allowing agents to learn from their reasoning [13][15]. - Three levels of implementation are proposed: 1. Adding critiques without modifying the model (Reagent-C) [15]. 2. Incorporating process scores as additional rewards (Reagent-R) [16]. 3. Training with both initial and revised responses (Reagent-U), which is emphasized as the most effective method [17][18]. Group 4: Experimental Results - The Reagent-U method showed significant improvements in performance across various tasks, with average scores reaching 43.7% on the GAIA benchmark, comparable to larger models [28][30]. - The integration of process scores led to agents being more willing to pursue correct reasoning paths, even if the final answer was incorrect [27][28]. Group 5: Conclusion - The Reagent framework successfully incorporates detailed feedback into agent training, demonstrating that even smaller models can achieve competitive results in complex tasks when provided with comprehensive reasoning evaluations [30][31].
交互扩展时代来临:创智复旦字节重磅发布AgentGym-RL,昇腾加持,开创智能体训练新范式
机器之心· 2025-09-11 04:53
Core Insights - The article emphasizes the transition of artificial intelligence from a "data-intensive" to an "experience-intensive" era, where true intelligence is derived from active exploration and experience accumulation in real environments [10][11][50]. - The introduction of the AgentGym-RL framework represents a significant advancement in training autonomous LLM agents for multi-turn decision-making, addressing the limitations of existing models that rely on single-turn tasks and lack diverse interaction mechanisms [12][50]. Group 1: Framework and Methodology - AgentGym-RL is the first end-to-end framework for LLM agents that does not require supervised fine-tuning, supports interactive multi-turn training, and has been validated in various real-world scenarios [3][15]. - The framework integrates multiple environments and rich trajectory data, simplifying complex environment configurations into modular operations, thus facilitating effective experience-driven learning [13][19]. - The ScalingInter-RL method introduces a progressive interaction round expansion strategy, allowing agents to gradually adapt to environments and optimize their interaction patterns, balancing exploration and exploitation [4][23][25]. Group 2: Performance and Results - The research team achieved remarkable results with a 7B parameter model, which demonstrated complex task handling skills such as understanding task objectives and planning multi-step operations after extensive interaction training [5][29]. - In various testing environments, the model not only surpassed large open-source models over 100B in size but also matched the performance of top commercial models like OpenAI o3 and Google Gemini 2.5 Pro [5][29]. - The ScalingInter-RL model achieved an overall accuracy of 26.00% in web navigation tasks, significantly outperforming GPT-4o's 16.00% and matching the performance of DeepSeek-R1-0528 and Gemini-2.5-Pro [29][30]. Group 3: Future Directions - Future research will focus on upgrading general capabilities to enable agents to make efficient decisions in new environments and with unknown tools [51]. - The team aims to expand into more complex scenarios that closely resemble the physical world, such as robotic operations and real-world planning [52]. - There is an intention to explore multi-agent collaboration training models to unlock more complex group decision-making capabilities [52].