Core Viewpoint - The article discusses the introduction of a new reinforcement learning algorithm, CPGD (Clipped Policy Gradient Optimization with Policy Drift), which significantly enhances model stability and performance in multi-modal reasoning tasks, outperforming traditional algorithms like GRPO and RLOO [1][6][11]. Group 1: Algorithm Development - CPGD algorithm alleviates training instability and improves performance, achieving an average performance increase of 11% over models trained with GRPO [1][14]. - The MM-Eureka-CPGD-7B model shows a 21.8% improvement on the MMK12 test set compared to the base model QwenVL2.5-7B, demonstrating superior generalization capabilities [1][14]. - The new algorithm introduces a logarithmic treatment of policy ratios and a policy drift term to stabilize training and control policy changes, proving more effective than existing methods [8][11]. Group 2: Model Performance - The MM-Eureka-CPGD-32B model surpasses the o1 model in various subjects, despite being trained solely on mathematical datasets [2][14]. - The MM-Eureka series has gained significant attention, with over 10,000 downloads and nearly 100 citations since its release [3][14]. - Performance metrics indicate that MM-Eureka-CPGD-7B outperforms leading models like OpenAI-o1 and GPT-4o across multiple datasets [13][15]. Group 3: Data and Framework - The MMK12 dataset, containing over 15,000 multi-modal math reasoning questions, addresses issues of single-type questions and inaccurate answers, becoming a key benchmark in multi-modal reasoning tasks [16][17]. - The multi-modal reinforcement learning framework built on OpenRLHF supports various models and algorithms, enhancing scalability and stability for large-scale training [4][5]. - The MM-PRM (Multi-modal Process Reward Model) focuses on the reasoning process, providing a structured approach to evaluate and guide model inference [18][21]. Group 4: Future Directions - The combination of PRM and reinforcement learning is seen as a promising area for further exploration, aiming to enhance model robustness and interpretability in complex reasoning tasks [22][24]. - The company plans to continue advancing multi-modal reasoning training and systematic optimization, inviting community participation in the development [25].
只训练数学,却在物理化学生物战胜o1!新强化学习算法带来显著性能提升,还缓解训练崩溃问题
量子位·2025-06-23 04:45