颠覆大模型后训练，陈丹琦团队提出「基于模型奖励思维的强化学习」RLMT

Core Insights - The article discusses a breakthrough in enhancing the reasoning capabilities of large language models (LLMs) through a new framework called Reinforcement Learning with Model Thinking (RLMT), which allows models to generate detailed reasoning chains before producing responses [2][6][25] - The RLMT framework combines the strengths of two existing paradigms: Reinforcement Learning from Human Feedback (RLHF) and Verifiable Reward Reinforcement Learning (RLVR), enabling better performance in open-ended tasks [6][8][25] - The research indicates that models trained with RLMT outperform existing models like GPT-4o and Llama-3.1-8B-Instruct, even with significantly fewer training prompts [3][16][25] Summary by Sections RLMT Framework - RLMT requires LLMs to produce a detailed reasoning trajectory before generating final responses, optimizing the entire process through online reinforcement learning [7][8] - The framework retains the RLVR approach of generating reasoning first while incorporating a preference-based reward model from RLHF, allowing models to learn to "think" in open-ended tasks [6][8] Model Performance - An 8 billion parameter model trained with RLMT surpassed GPT-4o in chat and creative writing tasks, achieving comparable performance to Claude-3.7-Sonnet [3][16] - The Llama-3.1-8B model trained with RLMT achieved an average score of 50.4 on WildBench, outperforming larger models with nearly ten times the parameters [16][17] Training Methodology - The RLMT framework demonstrated significant improvements even in zero-training scenarios, where the Llama-3.1-8B-RLMT-Zero model scored 15.6, surpassing the Llama-3.1-8B-Instruct model trained with over 25 million samples [18][25] - The research emphasizes that the quality of prompts, the strength of the reward model, and the reasoning process are critical for the success of RLMT [20][25] Implications for Future Research - The findings suggest a paradigm shift in language model training, indicating that enhancing a model's reasoning ability may be more effective than relying solely on large datasets [25][26] - Future research could explore optimizing reasoning formats and extending RLMT to other domains such as logical reasoning and multimodal models [25][26]