GRPO训练

Search documents
DeepSeek同款GRPO训练大提速!魔搭开源全流程方案,支持多模态训练、训练加速和评测全链路
量子位· 2025-03-09 04:45
Core Viewpoint - The article discusses the advancements in GRPO training tools from ModelScope, highlighting the introduction of the SWIFT framework and its optimizations to enhance training efficiency and stability in reinforcement learning models [1][10]. Group 1: GRPO Training Enhancements - GRPO training is based on an improved PPO algorithm, focusing on sampling principles to simplify the value model, thereby increasing training stability and maintainability [1]. - The SWIFT framework has been optimized for GRPO training, addressing challenges such as low training speed and complex cluster configurations [3][10]. - The introduction of asynchronous sampling allows for simultaneous sampling and training, significantly reducing training time compared to synchronous methods [4][5]. Group 2: Sampling Efficiency - The sampling time in GRPO training is a critical factor, with single-instance sampling often insufficient for larger models [3]. - By allowing multiple instances for data parallel sampling, the SWIFT framework can effectively allocate resources, improving sampling efficiency [3]. - Experiments show that using asynchronous sampling can reduce training time to about two-thirds compared to synchronous sampling [5]. Group 3: Multi-Round Updates - Multi-round updates enable the reuse of sampled data across multiple iterations, balancing resource allocation between sampling and training [11][12]. - Setting the number of iterations for updates can significantly enhance training speed without adversely affecting model performance [11][14]. Group 4: Performance Comparison - In comparative tests, the SWIFT framework demonstrated a training time of approximately 120 seconds per step, outperforming other frameworks like veRL and TRL [18]. - The integration of various acceleration techniques within SWIFT has led to significant improvements in training efficiency for GRPO in medium and small clusters [18]. Group 5: Multi-Modal GRPO Training - The SWIFT framework supports multi-modal GRPO training, accommodating various data types such as images, videos, and audio [20]. - The framework has been tested with the CLEVR-70k-Counting dataset, achieving high accuracy in multi-modal tasks [20][22]. Group 6: Evaluation Framework - EvalScope is introduced as a comprehensive evaluation tool for large models, providing performance assessment and visualization capabilities [23]. - The framework addresses issues of underthinking and overthinking in reasoning models, enhancing their efficiency in generating correct answers [23][27]. Group 7: Conclusion and Future Directions - SWIFT aims to provide a differentiated technical approach for developers in RL training, with ongoing support for various training domains [26][27]. - Future explorations will focus on reasoning models' thinking efficiency and the emerging paradigm of multi-modal reasoning [27].