ScaleRL

Search documents
Meta用40万个GPU小时做了一个实验,只为弄清强化学习Scaling Law
机器之心· 2025-10-19 09:17
Core Insights - The article discusses the advancements in Reinforcement Learning (RL) scaling, emphasizing the need for a systematic approach to understand how to effectively scale RL algorithms and their computational requirements [2][3][4]. Group 1: Research Background - Recent progress in RL has largely stemmed from isolated studies on specific algorithms or models, lacking a comprehensive scaling theory that limits broader research participation [3]. - The study aims to establish a scientific foundation for RL scaling by borrowing concepts from the well-developed "Scaling Law" in pre-training [3][4]. Group 2: Proposed Framework - A predictive framework is introduced to characterize the relationship between RL performance and computational power, using a sigmoid-like saturation curve to link expected rewards with training compute [5][7]. - The framework allows researchers to extrapolate performance at larger scales based on smaller experiments, facilitating the evaluation of RL methods' scalability without exhausting computational budgets [7]. Group 3: ScaleRL Development - ScaleRL is designed based on a systematic empirical study covering over 400,000 GPU hours, exploring various design choices on an 8B parameter model [8]. - Three key principles were identified: performance ceilings vary by method, methods that perform well at small scales may underperform at larger scales, and many techniques thought to enhance peak performance primarily affect computational efficiency [10][11]. Group 4: Algorithmic Choices - ScaleRL integrates existing methods rather than introducing new algorithms, combining asynchronous Pipeline-RL structures, length interruption mechanisms, and various loss functions to achieve predictable scaling [11][36]. - The study validates the effectiveness of these design choices through leave-one-out experiments, demonstrating that ScaleRL consistently outperforms existing RL configurations in both performance and efficiency [38]. Group 5: Predictive Performance Insights - The research investigates which scaling dimensions—context length, batch size, generation count per prompt, or model size—yield the most reliable performance improvements under fixed or growing computational budgets [39]. - Results indicate that larger batch sizes stabilize performance ceilings and avoid premature stagnation, while increasing generation lengths can enhance performance ceilings [42][47]. Group 6: Conclusion and Recommendations - The findings establish a rigorous, quantifiable methodology for predicting the scalability of new RL algorithms, making it a significant contribution to the field of RL in large language models [11][50].