强化预训练(RPT)

Search documents
「Next-Token」范式改变!刚刚,强化学习预训练来了
机器之心· 2025-06-11 03:54
Core Viewpoint - The article discusses the emerging importance of Reinforcement Learning (RL) in enhancing AI model capabilities, particularly through a new paradigm called Reinforcement Pre-Training (RPT) which redefines next-token prediction as a reasoning task [3][10][24]. Summary by Sections Introduction - Yann LeCun previously viewed reinforcement learning as a minor component in AI, but its significance is growing in model enhancement [3]. RPT Overview - RPT transforms the next-token prediction task into a reasoning process, allowing models to receive verifiable rewards for correct predictions [6][25]. - This method leverages vast amounts of unannotated text data for general reinforcement learning without requiring domain-specific labeled answers [9][26]. Advantages of RPT - RPT offers inherent scalability and generality by utilizing large unannotated datasets for training [28]. - It minimizes the risk of reward hacking by using direct, rule-based reward signals [29]. - The internal reasoning process during pre-training allows for deeper understanding and generalization beyond mere token memorization [30]. - RPT enhances prediction accuracy by allocating more computational resources to each prediction step [31]. Experimental Results - RPT outperforms baseline methods in next-token prediction accuracy across various difficulty levels [40][41]. - The performance of RPT-14B is comparable to that of larger models, indicating its effectiveness in capturing complex reasoning signals [43]. - RPT's accuracy improves reliably with increased training computation, demonstrating its scaling characteristics [45]. - Models pre-trained with RPT achieve higher performance ceilings when further trained with RLVR, showcasing its ability to transfer learned reasoning patterns to downstream tasks [47]. Zero-Shot Performance - RPT-14B consistently surpasses R1-Distill-Qwen-14B across all benchmark tests, even outperforming larger models in next-token prediction [49]. Reasoning Mode Analysis - The reasoning process of RPT-14B differs qualitatively from that of R1-Distill-Qwen-14B, indicating a more thoughtful approach rather than simple pattern matching [51].