Workflow
StaleSync准异步并行技术
icon
Search documents
RL后训练步入超节点时代!华为黑科技榨干算力,一张卡干俩活
21世纪经济报道· 2025-06-05 11:03
Core Viewpoint - Reinforcement Learning (RL) post-training has become a crucial method for breaking through the performance ceiling of large language models (LLMs), with Huawei introducing two key technologies to enhance efficiency and resource utilization in this process [1][2][26]. Group 1: RL Post-Training Technologies - RL post-training now consumes 20% of the total computational power in the training process, projected to rise to 50%, significantly impacting model performance and costs [1]. - Huawei's "RL Fusion" technology allows a single card to handle both training and inference tasks simultaneously, doubling resource utilization and throughput [4][5]. - The "StaleSync" mechanism breaks the synchronization limitations, achieving over 90% efficiency in cluster expansion and a 50% increase in training throughput [2][10]. Group 2: Challenges in RL Post-Training - The traditional On-Policy algorithm requires alternating between training and inference tasks, leading to significant resource idling, especially in large-scale clusters [3]. - The complexity of task scheduling has increased exponentially with the popularity of Mixture of Experts (MoE) models, complicating resource utilization [4]. Group 3: Performance Improvements - RL Fusion technology enables dynamic switching between training and inference modes, optimizing memory usage and enhancing efficiency [5][8]. - The combination of RL Fusion and StaleSync technologies has led to a 78.5% increase in throughput for single supernodes, with overall performance improvements of 1.5 times [22][24]. - StaleSync allows for linear scalability in cluster expansion, with throughput increasing from 35k tokens/s to 127k tokens/s as the cluster size grows from one to four supernodes, achieving a linearity of 91% [24]. Group 4: Future Implications - The advancements in RL post-training technologies position Huawei's CloudMatrix 384 supernodes as a "super accelerator" for training large models, enhancing speed and efficiency significantly [2][26]. - The innovations in resource utilization and task parallelism are expected to drive the next generation of AI efficiency, marking a pivotal moment in the evolution of large model training [26].
RL后训练步入超节点时代!华为黑科技榨干算力,一张卡干俩活
雷峰网· 2025-06-05 09:17
Core Viewpoint - Reinforcement Learning (RL) post-training has become a crucial path for breaking through the performance ceiling of large language models (LLMs), with Huawei introducing two key technologies to enhance efficiency and resource utilization in this process [2][3][56]. Group 1: RL Post-Training Challenges - RL post-training currently consumes 20% of the total computational power in the training process, projected to rise to 50%, significantly impacting model performance and costs [3]. - Traditional RL post-training suffers from low resource utilization due to the alternating execution of training and inference tasks, leading to substantial computational waste [11][13]. - The complexity of task scheduling in large-scale clusters has increased due to the popularity of Mixture of Experts (MoE) models, making efficient collaboration challenging [15][16]. Group 2: Huawei's Innovations - Huawei's "RL Fusion" technology allows a single card to handle both training and inference tasks simultaneously, effectively doubling resource utilization and throughput [5][18]. - The "StaleSync" mechanism enables a quasi-asynchronous approach, allowing different RL tasks to execute in parallel within a defined "staleness threshold," improving horizontal scaling efficiency to over 90% [29][32]. - The combination of RL Fusion and StaleSync technologies significantly enhances the efficiency of RL post-training, achieving a throughput increase of 1.5 times [52][56]. Group 3: Performance Metrics - The implementation of RL Fusion can lead to a throughput increase from 14.0k tokens/sec to 35.0k tokens/sec when combined with StaleSync, representing a 150% improvement compared to baseline configurations [54]. - In a multi-node setup, StaleSync allows for linear scaling efficiency, with throughput increasing from 35k tokens/sec to 127k tokens/sec as the number of nodes increases from 1 to 4, achieving a linearity of 91% [55].