Workflow
让强化学习快如闪电:FlashRL一条命令实现极速Rollout,已全部开源
机器之心·2025-08-12 09:51

Core Viewpoint - The article discusses the development and implementation of FlashRL, an open-source reinforcement learning solution that utilizes quantized rollouts without sacrificing downstream performance, addressing the challenges of rollout-training mismatch through the introduction of Truncated Importance Sampling (TIS) [4][16][37]. Group 1: DAPO and Rollout Challenges - DAPO, developed by Tsinghua AIR and ByteDance, is an open-source SOTA system for large-scale LLM reinforcement learning, achieving a score of 50 on the AIME 2024 benchmark with the Qwen2.5-32B model [1]. - The research team identified that rollout generation is a major bottleneck in reinforcement learning training, consuming approximately 70% of total training time [3]. - The application of 8-bit quantization during rollout generation, combined with TIS technology, significantly accelerates the process while maintaining downstream performance [3][4]. Group 2: FlashRL Implementation - FlashRL is the first open-source reinforcement learning implementation that applies INT8/FP8 during the rollout phase, achieving performance parity with BF16 without any performance loss [4][15]. - The introduction of TIS mitigates the rollout-training mismatch, allowing quantized rollout training to achieve performance levels comparable to BF16 rollout training, and even surpassing naive BF16 rollout training [16][37]. - FlashRL supports online quantization and has been integrated with existing inference engines like vLLM to enhance their capabilities for models with parameter updates [22]. Group 3: Performance and Acceleration - FlashRL's INT8 rollout can provide up to 1.7 times throughput improvement while retaining the advantages of reinforcement learning [23]. - In standard environments, the acceleration observed with 8-bit quantization is more pronounced in larger models, with a speedup of up to 1.75 times for the 32B model compared to BF16 [29]. - In memory-constrained environments, INT8 quantization can lead to over 3 times speedup in generation speed, highlighting its potential for larger models [34]. Group 4: Validation and Usage - The effectiveness of FlashRL was validated in training the DAPO-32B model, demonstrating that INT8 rollout significantly improves training speed without compromising accuracy on the AIME benchmark [36][37]. - FlashRL can be easily implemented with a single command, allowing users to integrate it into their RL training without code modifications [41].