熵控制强化学习

Search documents
多轮Agent训练遇到级联失效?熵控制强化学习来破局
机器之心· 2025-10-17 08:12
Core Insights - The article identifies a significant training instability issue encountered when training multi-turn LLM agents in sparse reward environments, specifically highlighting the "exploration-exploitation cascade failure" phenomenon [2][5][7] - The proposed solution is the Entropy-regularized Policy Optimization (EPO) framework, which includes three core mechanisms aimed at stabilizing training and improving performance [3][11][12] Problem Identification - The training dynamics of standard algorithms like PPO and GRPO exhibit extreme instability, characterized by erratic entropy fluctuations and stagnant reward curves despite extensive training [5][6][7] - The unique failure mode in multi-turn sparse reward environments is identified as a two-stage process: excessive early exploration leading to unstable behavior and subsequent uncertainty propagation affecting later decisions [7][9][40] Proposed Solution: EPO Framework - EPO consists of three synergistic mechanisms: multi-turn entropy regularization, entropy smoothing regularizer, and adaptive weights [3][11][12] - The multi-turn entropy regularization captures the unique temporal structure of agent interactions by averaging entropy across all turns within a trajectory [12] - The entropy smoothing regularizer prevents dangerous oscillations observed in sparse reward settings by maintaining a historical entropy reference [15][17] - The adaptive weight scheme dynamically balances exploration and exploitation during training, directly countering the cascade failure [19][21] Experimental Results - EPO demonstrates significant performance improvements, achieving a 152.1% success rate increase in the ScienceWorld environment compared to baseline PPO, and a 19.8% increase in ALFWorld [24][42] - Training curves indicate that PPO+EPO maintains a smooth upward trajectory in rewards, contrasting with the instability of baseline methods [26][42] Key Contributions - The work formalizes the unique cascade failure phenomenon in multi-turn sparse reward environments and proposes the EPO framework as a solution [41][42] - EPO is shown to provide theoretical guarantees of reduced entropy variance and superior performance compared to standard maximum entropy reinforcement learning [41][42] - The findings establish that training multi-turn LLM agents requires fundamentally different entropy control strategies than traditional reinforcement learning approaches [42]