只需1/4预算,性能反超基线:阿里高德提出Tree-GRPO,高效破解智能体RL难题
机器之心·2025-10-13 23:56

Core Insights - The article discusses the Tree-GRPO method proposed by Alibaba Gaode, which enhances reinforcement learning (RL) for agents by transforming independent chain sampling into tree search at the agent step level, addressing high rollout costs and sparse reward signals [2][4][23]. Group 1: Agentic RL Challenges - Agentic RL faces two main challenges: high rollout costs involving thousands of tokens and tool calls, and sparse supervision signals that only evaluate the final reward, making it difficult to identify which actions contributed to success or failure [12][19]. - Existing tree search RL methods typically operate at the token or sentence level, which is not suitable for agents with clear step-level semantic structures [8][19]. Group 2: Tree-GRPO Methodology - The Tree-GRPO method uses "agent steps" as tree nodes, where each node corresponds to a complete think-action-observe step, allowing for more effective trajectory sampling within a given budget [6][8]. - The method initializes multiple independent trajectories and samples nodes to expand the tree, ultimately generating diverse agent trajectories under the same rollout budget [8][19]. Group 3: Performance and Results - In experiments across 11 knowledge-intensive question-answering tasks, Tree-GRPO consistently outperformed chain-based RL methods, achieving significant performance improvements, such as a 69% relative increase in multi-hop QA performance on the smaller Qwen2.5-1.5b model [15][19]. - The method demonstrated a 112% improvement over chain-based methods under extremely limited budget conditions, showcasing its efficiency [19][20]. Group 4: Future Directions - The Tree-GRPO algorithm presents a new approach to Agentic RL, effectively addressing the issues of high rollout budgets and sparse supervision signals, leading to more efficient and stable RL training in multi-turn agent tasks [23][24]. - The team emphasizes the importance of dynamically adjusting the balance between exploration and exploitation in RL training to optimize learning outcomes [24].

只需1/4预算,性能反超基线:阿里高德提出Tree-GRPO,高效破解智能体RL难题 - Reportify