Workflow
首创Mid-training范式破解RL奥秘,Llama终于追平Qwen!
机器之心·2025-06-30 09:49

Core Insights - A recent research paper from Shanghai Chuangzhi Academy and Shanghai Jiao Tong University explores the differing performances of foundational language models like Llama and Qwen in reinforcement learning (RL) training, proposing a mid-training strategy that significantly enhances Llama's compatibility with RL, narrowing the performance gap with Qwen [1][10][11]. Research Background - The introduction of large-scale RL into language models has notably improved complex reasoning abilities, particularly in challenging tasks like mathematical competitions. However, only the Qwen series has shown substantial RL enhancements, raising questions about the foundational characteristics that determine a model's adaptability to RL scaling [9][10]. Mid-Training Strategy - The research team conducted extensive mid-training experiments on the Llama-3.2-3B model, utilizing controlled mid-training to explore key factors influencing RL performance. They found that high-quality mathematical datasets significantly improve RL outcomes, while low-quality data can lead to instability [14][16][18]. Data Quality and Preprocessing - The team created the MegaMath-Web-Pro-Max dataset to support large-scale ablation studies and mid-training, which is approximately 5.5 times larger than its predecessor, MegaMath-Web-Pro. This dataset was refined using a custom classifier to ensure high quality [19][25]. Two-Stage Training Approach - A two-stage mid-training strategy was proposed, consisting of a stable reasoning foundation phase followed by specialized training to enhance model adaptability. This approach resulted in significant performance improvements across various mathematical reasoning benchmarks [27][30]. Performance Improvements - The OctoThinker base model series demonstrated a 10%-20% performance increase in mathematical reasoning tasks compared to the original Llama models. For instance, in benchmarks like GSM8K and MATH500, OctoThinker models showed marked improvements in accuracy and reasoning depth [31][32][33]. Future Directions - The research team plans to refine mathematical pre-training datasets, design RL-friendly foundational models without relying on strong long-chain reasoning models, and expand the OctoThinker family to include new branches like tool-integrated reasoning [38].