陈丹琦新作：大模型强化学习的第三条路，8B小模型超越GPT-4o

Core Viewpoint - The article discusses a new method called RLMT (Reinforcement Learning with Model-rewarded Thinking) that combines the advantages of RLHF (Reinforcement Learning from Human Feedback) and RLVR (Reinforcement Learning with Verifiable Rewards), enabling an 8 billion parameter model to outperform GPT-4o and rival Claude-3.7-Sonnet [1][4][11]. Group 1: Methodology and Performance - RLMT requires the model to generate a Chain of Thought (CoT) before producing an answer, which is then evaluated by a reward model trained on human preferences [5][17]. - The method can be directly applied to base models without the need for supervised fine-tuning (SFT), significantly reducing post-training costs [6][22]. - In benchmark tests, the L3.1-8B-RLMT model achieved an average score of 84.3, surpassing larger models like GPT-40 and Claude3.7-Sonnet [7]. Group 2: Training Process - The training process involves generating a reasoning trajectory based on user prompts, followed by scoring the final answer using a reward model [14]. - Two training approaches are highlighted: Warm-start (using SFT data) and Zero (direct training without SFT), both leading to improved performance [21][19]. - The RLMT method enhances the model's reasoning style to resemble human thought processes, resulting in higher quality dialogue and writing [19]. Group 3: Implications and Future Directions - The introduction of RLMT sets a new baseline for general reinforcement learning, emphasizing the importance of defining preferences in the post-training era [8]. - The results indicate that smaller models can achieve superior performance compared to larger models, suggesting a shift in focus towards efficiency in model training [22]. - The research team, led by Chen Danqi, aims to further explore natural language understanding and reasoning capabilities in future studies [24][25].