训练奖励太稀疏？港中文联合美团给Agent加上「过程分」

Core Insights - The article discusses the limitations of traditional reward systems in training agents, which often only consider the final outcome, neglecting the complexity of the reasoning process involved in multi-step tasks [2][3][6]. - The Reagent framework aims to address this issue by providing detailed feedback on the entire reasoning process, rather than just the final answer, thus enhancing the training of agents [5][10][12]. Group 1: Problem Identification - Agents require long-term, granular feedback, but most existing systems only provide coarse-grained rewards based on final outcomes [3]. - The traditional approach fails to differentiate between slightly successful attempts and completely misguided efforts, leading to a lack of valuable learning opportunities [2][6]. Group 2: Solution Development - The authors developed a reasoning reward model (Agent-RRM) that evaluates the entire trajectory of an agent's reasoning process, providing scores and critiques [10][11]. - This model outputs an internal analysis, a critique for the agent, and an overall score, allowing for a more nuanced understanding of the agent's performance [10][11]. Group 3: Implementation of Reagent Framework - The Reagent framework integrates textual critiques and scoring into the training process, allowing agents to learn from their reasoning [13][15]. - Three levels of implementation are proposed: 1. Adding critiques without modifying the model (Reagent-C) [15]. 2. Incorporating process scores as additional rewards (Reagent-R) [16]. 3. Training with both initial and revised responses (Reagent-U), which is emphasized as the most effective method [17][18]. Group 4: Experimental Results - The Reagent-U method showed significant improvements in performance across various tasks, with average scores reaching 43.7% on the GAIA benchmark, comparable to larger models [28][30]. - The integration of process scores led to agents being more willing to pursue correct reasoning paths, even if the final answer was incorrect [27][28]. Group 5: Conclusion - The Reagent framework successfully incorporates detailed feedback into agent training, demonstrating that even smaller models can achieve competitive results in complex tasks when provided with comprehensive reasoning evaluations [30][31].