Workflow
AEPO
icon
Search documents
AEPO:智能体熵平衡策略优化,让探索更稳,推理更深!
机器之心· 2025-11-01 04:22
Core Insights - The article discusses the development of Agentic Entropy-Balanced Policy Optimization (AEPO), a new algorithm aimed at balancing exploration and stability in multi-round reinforcement learning for intelligent agents [2][10][11]. Group 1: Algorithm Overview - AEPO addresses the issues of "high-entropy rollout collapse" and "high-entropy gradient clipping" in existing Agentic RL methods, proposing two core mechanisms: dynamic entropy-balanced rollout sampling and entropy-balanced policy optimization [2][11]. - The algorithm has shown significant improvements over seven mainstream reinforcement learning algorithms across 14 cross-domain benchmarks, particularly in deep search tasks [4][12]. Group 2: Performance Metrics - AEPO achieved a Pass@5 score of 61.5% in deep search tasks, outperforming other methods such as ARPO and GRPO by an average of 5.8% [36][37]. - The algorithm maintains training stability while enhancing sampling diversity and reasoning efficiency, providing a new optimization paradigm for scalable reinforcement training of general intelligent agents [4][12]. Group 3: Research Motivation - The motivation behind AEPO is to find a balance in high-entropy environments, where excessive exploration can lead to instability and local optima [8][10]. - The research highlights the dual contradiction of high-entropy signals, which are necessary for exploration but can disrupt resource allocation and hinder learning [14][20]. Group 4: Future Directions - Future research may expand AEPO to multi-modal inputs, complex tool ecosystems, and multi-agent reinforcement learning scenarios to enhance collaborative strategies and performance [41].