自适应提示优化策略

Search documents
首次结合RL与SFT各自优势,动态引导模型实现推理⾼效训练
机器之心· 2025-07-27 15:54
Core Viewpoint - The article discusses the development and advantages of the GHPO algorithm framework, which integrates online reinforcement learning and imitation learning to enhance the performance and stability of large language models in complex reasoning tasks [3][5][21]. Group 1: Background and Current Challenges - New generation large inference models like OpenAI-o3, DeepSeek-R1, and Kimi-1.5 have made significant progress in complex reasoning, primarily through a training method called ZERO-RL, which uses Reinforcement Learning with Verifiable Rewards (RLVR) to improve reasoning capabilities [1]. - Current RLVR methods, such as Group Relative Policy Optimization (GRPO), face limitations including a gap between training data difficulty and model capability, leading to sparse rewards that hinder learning stability [2]. Group 2: GHPO Algorithm Framework - The GHPO algorithm framework was developed through collaboration between Huawei's Hong Kong Research Institute, Noah's Ark Lab, and City University of Hong Kong, aiming to address the limitations of existing RLVR methods [3]. - GHPO significantly improves sample efficiency for edge models and alleviates the sparse reward phenomenon in RLVR methods, achieving performance improvements of 9% and 10% on specific benchmarks [5][18]. Group 3: Methodology and Innovations - The GHPO framework introduces a novel approach by integrating standard problem-solving processes into the reinforcement learning loop, which helps to mitigate the sparse reward issue and enhances the model's generalization ability in reasoning tasks [9][10]. - GHPO employs dynamic sample difficulty assessment and adaptive switching between reinforcement learning and imitation learning, ensuring that guidance is provided only when necessary [11][14]. Group 4: Experimental Results and Performance - Experiments demonstrated that GHPO outperforms GRPO by an average of 4.5% in performance, with more stable gradient updates during training [18][19]. - The algorithm has been validated on various models, including Qwen2.5-Math-7B, showcasing its versatility and effectiveness across different difficulty distributions in training datasets [19]. Group 5: Future Implications - GHPO represents a significant advancement in the integration of reinforcement learning and supervised fine-tuning (SFT), providing a new perspective on the relationship between these methodologies and the potential for deeper fusion in future AI explorations [21].