Workflow
策略梯度
icon
Search documents
RL训练中,为什么熵减往往意味着训练收敛?
自动驾驶之心· 2025-10-29 00:04
Core Insights - The article discusses the relationship between entropy reduction and convergence in reinforcement learning (RL) training, particularly focusing on softmax policies and their implications for strategy optimization [3][4]. Group 1: Entropy and Strategy Convergence - Entropy converging to zero indicates that the strategy is polarizing towards a deterministic solution, making it difficult to escape local optima, which is a key aspect of convergence [3][4]. - The first theoretical result indicates that for softmax strategies, the expected gradient norm of the strategy at state s is directly related to the Renyi-2 entropy, suggesting that as entropy approaches zero, the expected gradient norm also approaches zero [6][7]. - The second theoretical result shows that as entropy decreases, the upper bound on the movement of the strategy in terms of Reverse KL divergence also decreases, indicating a tighter convergence of strategies [8][16]. Group 2: Implications of Softmax Parameterization - The unique curvature properties of softmax parameterization lead to a decline in learning efficiency as entropy decreases, which can trap the model in local optima [17]. - The article suggests that alternative parameterizations, such as Newton's method or Hadamard parameterization, may help overcome the limitations imposed by softmax parameterization in RL training [17].
微软副总裁X上「开课」,连更关于RL的一切,LLM从业者必读
机器之心· 2025-05-26 01:28
Core Viewpoint - The article discusses the educational series on artificial intelligence initiated by Nando de Freitas, focusing on reinforcement learning (RL) and its applications in large language models (LLMs) [1][2]. Summary by Sections Introduction to AI Education - Nando de Freitas aims to educate readers on AI through a series of posts on X, starting with reinforcement learning and gradually covering diffusion and flow matching technologies [1][2]. Learning Types - The article highlights that there is no ultimate conclusion on unsupervised learning, supervised learning, and reinforcement learning [8][19]. - Supervised learning is described as basic imitation, requiring high-quality expert data for effective learning [9]. - Reinforcement learning focuses on selective imitation, allowing agents to learn from suboptimal experiences and improve their performance [10][11]. Distributed Reinforcement Learning Systems - Modern distributed RL systems consist of two main components: Actors and Learners, where Actors interact with the environment and collect data, while Learners update the policy network based on this data [23][24]. - The importance of measuring operational durations and communication bandwidth in such systems is emphasized [24][27]. Offline Reinforcement Learning - Offline RL has unique value in scenarios like post-training LLMs, where it can leverage historical data for learning [28][29]. Single-step and Multi-step RL - The article differentiates between single-step and multi-step RL problems, with single-step focusing on immediate actions and multi-step involving planning over a series of interactions [35][39]. - The complexity of multi-step RL is noted, particularly in credit assignment issues where multiple decisions affect outcomes [40][41]. Policy Gradient and Techniques - Policy gradient methods are discussed, including the use of baseline subtraction to reduce variance in reward signals [49][56]. - The article also covers the significance of KL divergence in maintaining proximity to supervised fine-tuning strategies during post-training [69]. Importance Sampling and PPO - Importance sampling is introduced as a method to correct off-policy sample bias, with Proximal Policy Optimization (PPO) being a key technique to manage policy updates [73][78]. - The integration of various techniques in training models like DeepSeek-R1 is highlighted, showcasing the complexity of modern RL systems [81]. Future Directions - Freitas plans to expand the discussion from single-step to multi-step RL, indicating ongoing developments in the field [82].