Workflow
熵崩塌
icon
Search documents
NeurIPS25高分论文|以判别式监督学习强化推理LLM,解决难度偏差和熵崩塌难题
机器之心· 2025-10-26 07:00
Core Insights - The article discusses the introduction of a novel framework called Discriminative Constrained Optimization (DisCO) aimed at enhancing large reasoning models (LRMs) by addressing inherent limitations of the Group Relative Policy Optimization (GRPO) method, particularly in binary reward settings [3][4][6][32]. Summary by Sections Introduction to DisCO - DisCO is proposed as a solution to the difficulty bias and entropy instability issues found in GRPO and its variants, allowing for the integration of advanced discriminative learning techniques to tackle data imbalance problems [4][6][32]. Advantages of DisCO - DisCO significantly outperforms GRPO and its improved versions, achieving an average gain of 7% over GRPO and 6% over DAPO across six benchmark tasks with a 1.5 billion parameter model [4][22]. - Notably, DisCO with a maximum response length of 8k outperforms GRPO with a maximum response length of 32k [4]. Methodology - The framework eliminates difficulty bias by adopting a discriminative optimization objective, which maximizes the score of correct answers while minimizing that of incorrect ones [6][11]. - It employs non-clipped scoring functions and a constrained optimization approach to stabilize training dynamics, addressing issues of entropy instability [6][19][28]. Experimental Results - DisCO consistently demonstrates superior performance across various models, including a 3.5% improvement over GRPO in 7 billion parameter experiments [22]. - The training dynamics of DisCO show a steady increase in training rewards and stable generation entropy, contrasting with the instability observed in GRPO and its variants [27][28]. Ablation Studies - The analysis of individual components within DisCO reveals that each component contributes significantly to its overall performance, with the use of non-clipped scoring functions being particularly critical [30]. Future Directions - While the current focus is on binary rewards, the authors suggest that future research could explore the application of DisCO to non-binary reward scenarios, potentially utilizing novel scoring functions from supervised learning [32].
拒绝“熵崩塌”和“熵爆炸”!这项研究让大模型学会“精确探索”,推理成绩飙升
量子位· 2025-10-13 08:47
Core Insights - The article discusses the advancements in large language models (LLMs) using a method called RLVR (Reinforcement Learning with Verifiable Rewards), which has led to significant breakthroughs in mathematical, coding, and scientific reasoning tasks since 2024 [1][2]. Group 1: Challenges in RLVR Training - RLVR faces a critical bottleneck known as the "exploration imbalance," where exploration can either be too limited, leading to entropy collapse, or too uncontrolled, resulting in entropy explosion [2][9]. - The traditional entropy regularization method encourages exploration but can lead to either rapid convergence to a deterministic strategy or chaotic outputs due to excessive uncertainty [6][10]. Group 2: Proposed Solution - SIREN - The research team introduced a Selective Entropy Regularization method (SIREN) that employs three mechanisms: defining the exploration range, focusing on key decision points, and stabilizing the training process [14][18]. - SIREN limits entropy calculations to a core set of high-probability tokens, ensuring that exploration occurs only within semantically reasonable candidates [14][15]. - It identifies key decision points in the generation sequence where entropy is significantly higher than average, concentrating exploration incentives on these critical areas [16]. - The method adjusts the entropy target to maintain it within a reasonable range, preventing training instability [17]. Group 3: Experimental Validation - Experimental results demonstrate that SIREN significantly improves performance across various models and datasets, achieving an average major accuracy (maj@k) of 54.6% on Qwen2.5-Math-7B, surpassing the strongest baseline by 4.8% [22][24]. - The effective exploration facilitated by SIREN leads to a fundamental change in performance compared to traditional entropy regularization methods [25][32]. - The research indicates that SIREN maintains diversity in answers and avoids confusion collapse, contributing to a smoother and more controllable training process [28][30]. Group 4: Future Implications - The study emphasizes the importance of stable, controllable, and efficient exploration in releasing the potential of large models and overcoming performance bottlenecks [35]. - The proposed selective exploration control mechanism offers a feasible solution for refining exploration strategies in future reasoning model training paradigms [35].