Workflow
大模型强化学习
icon
Search documents
10行代码,AIME24/25提高15%!揭秘大模型强化学习熵机制
机器之心· 2025-06-05 07:14
Core Insights - The article discusses the entropy collapse problem in reinforcement learning for large language models (LLMs) and proposes solutions to enhance exploration capabilities during training [3][5][24]. Group 1: Entropy Collapse in Reinforcement Learning - The core challenge in reinforcement learning is the trade-off between exploitation and exploration, where policy entropy is a key indicator of exploration potential [4]. - A significant finding is that policy entropy rapidly decreases to near zero within a few training steps, indicating a loss of exploration ability, which leads to performance stagnation [4][5]. - The relationship between policy entropy and downstream performance is quantitatively analyzed, revealing that performance is entirely determined by policy entropy in the absence of entropy interventions [4][5]. Group 2: Mechanisms Behind Entropy Changes - The study identifies the driving factors behind the changes in policy entropy during reinforcement learning, focusing on the covariance between action probabilities and their corresponding advantages [5][13]. - It is found that high-advantage and high-probability actions reduce policy entropy, while rare high-advantage actions increase it [13][17]. Group 3: Proposed Solutions for Enhancing Entropy - The article introduces two simple yet effective entropy-enhancing reinforcement learning strategies, Clip-Cov and KL-Cov, which can be implemented with minimal code changes [5][22]. - Experimental results demonstrate that these methods significantly improve performance, achieving a 6.4% increase on Qwen2.5-32B and up to 15% on challenging datasets like AIME24/25 [22][24]. - The research emphasizes the importance of maintaining exploration capabilities to achieve scalable reinforcement learning, suggesting that merely increasing computational power may yield limited benefits without addressing the entropy bottleneck [7][24].
大模型RL不止数学代码!7B奖励模型搞定医学法律经济全学科, 不用思维链也能做题
量子位· 2025-04-02 07:40
Core Insights - The article discusses a new framework called RLVR developed by Tencent and Suzhou University, which extends reinforcement learning training to various disciplines beyond mathematics and coding, including medicine, chemistry, law, psychology, and economics [3][4]. Group 1: Framework and Methodology - RLVR utilizes a model-based soft reward system, which shows significant improvements in generalization, robustness, and scalability compared to traditional binary rule-based rewards [4]. - The research is based on the observation that when tasks have objective reference answers, different large language models exhibit high consistency in binary judgments (correct/incorrect) [7]. - The team distilled a 7B reward model from a 72B parameter model (Qwen2.5-Instruct) without requiring domain-specific annotations, relying solely on data collected during the online exploration phase [9]. Group 2: Experimental Results - The study sampled 6,000 questions from ExamQA, covering a wide range of subjects in science, engineering, and humanities [12]. - The RM-7B model demonstrated superior performance in free-form answer tasks compared to various baseline models, including base models, fine-tuned models, and rule-based reinforcement learning [14]. - The RM-7B model achieved an average score of 62.5 in multi-subject tasks, outperforming other methods in both binary and soft reward categories [15]. Group 3: Scalability and Future Research - The research indicates that model-based rewards have better scalability when data volume increases, suggesting a more effective approach for handling non-structured reference answers [18]. - The authors note that while chain-of-thought reasoning (CoT) is beneficial in various scenarios, its necessity for evaluating semantic equivalence between reference answers and model responses remains an open question [16]. - The study does not impose format constraints on reference answers or model responses, which reduces the labor involved in data standardization, but the role of format-related constraints and rewards needs further examination [17].