判别式约束优化
Search documents
 NeurIPS25高分论文|以判别式监督学习强化推理LLM,解决难度偏差和熵崩塌难题
 机器之心· 2025-10-26 07:00
 Core Insights - The article discusses the introduction of a novel framework called Discriminative Constrained Optimization (DisCO) aimed at enhancing large reasoning models (LRMs) by addressing inherent limitations of the Group Relative Policy Optimization (GRPO) method, particularly in binary reward settings [3][4][6][32].   Summary by Sections   Introduction to DisCO - DisCO is proposed as a solution to the difficulty bias and entropy instability issues found in GRPO and its variants, allowing for the integration of advanced discriminative learning techniques to tackle data imbalance problems [4][6][32].   Advantages of DisCO - DisCO significantly outperforms GRPO and its improved versions, achieving an average gain of 7% over GRPO and 6% over DAPO across six benchmark tasks with a 1.5 billion parameter model [4][22]. - Notably, DisCO with a maximum response length of 8k outperforms GRPO with a maximum response length of 32k [4].   Methodology - The framework eliminates difficulty bias by adopting a discriminative optimization objective, which maximizes the score of correct answers while minimizing that of incorrect ones [6][11]. - It employs non-clipped scoring functions and a constrained optimization approach to stabilize training dynamics, addressing issues of entropy instability [6][19][28].   Experimental Results - DisCO consistently demonstrates superior performance across various models, including a 3.5% improvement over GRPO in 7 billion parameter experiments [22]. - The training dynamics of DisCO show a steady increase in training rewards and stable generation entropy, contrasting with the instability observed in GRPO and its variants [27][28].   Ablation Studies - The analysis of individual components within DisCO reveals that each component contributes significantly to its overall performance, with the use of non-clipped scoring functions being particularly critical [30].   Future Directions - While the current focus is on binary rewards, the authors suggest that future research could explore the application of DisCO to non-binary reward scenarios, potentially utilizing novel scoring functions from supervised learning [32].