RLHF与RLVR全都要，陈丹琦团队最新力作将推理能力拓展到通用智能

Core Insights - The article discusses the introduction of a new method called Reinforcement Learning with Model Thinking (RLMT), which integrates explicit reasoning into general chat models, enhancing their performance in open-ended tasks [5][7][26]. Summary by Sections Introduction - The article highlights the recent academic contributions of Chen Danqi from Princeton University, who has developed the RLMT method, which aims to bridge the gap between specialized reasoning capabilities and general conversational abilities in AI [2][5]. Methodology - RLMT combines aspects of Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) to optimize language models for open-ended tasks [6][11]. - The training process involves two approaches: supervised fine-tuning (SFT) to teach the desired reasoning format and a zero-training method that directly applies RLMT to base models without prior training [12][14]. Results - Models trained with RLMT demonstrated superior performance in open-ended reasoning tasks compared to non-thinking baseline models, particularly in chat and creative writing benchmarks [18][26]. - The article presents comparative performance data showing that RLMT models outperformed other models, including GPT-4o and Claude-3.7-Sonnet, in various chat benchmarks [19][20]. Conclusion - RLMT successfully extends the advantages of explicit reasoning from specialized domains to general conversational AI, indicating its potential to reshape language model training methodologies [26][29].