RL后训练步入超节点时代！华为黑科技榨干算力，一张卡干俩活

Core Viewpoint - Reinforcement Learning (RL) post-training has become a crucial method for breaking through the performance ceiling of large language models (LLMs), with Huawei introducing two key technologies to enhance efficiency and resource utilization in this process [1][2][26]. Group 1: RL Post-Training Technologies - RL post-training now consumes 20% of the total computational power in the training process, projected to rise to 50%, significantly impacting model performance and costs [1]. - Huawei's "RL Fusion" technology allows a single card to handle both training and inference tasks simultaneously, doubling resource utilization and throughput [4][5]. - The "StaleSync" mechanism breaks the synchronization limitations, achieving over 90% efficiency in cluster expansion and a 50% increase in training throughput [2][10]. Group 2: Challenges in RL Post-Training - The traditional On-Policy algorithm requires alternating between training and inference tasks, leading to significant resource idling, especially in large-scale clusters [3]. - The complexity of task scheduling has increased exponentially with the popularity of Mixture of Experts (MoE) models, complicating resource utilization [4]. Group 3: Performance Improvements - RL Fusion technology enables dynamic switching between training and inference modes, optimizing memory usage and enhancing efficiency [5][8]. - The combination of RL Fusion and StaleSync technologies has led to a 78.5% increase in throughput for single supernodes, with overall performance improvements of 1.5 times [22][24]. - StaleSync allows for linear scalability in cluster expansion, with throughput increasing from 35k tokens/s to 127k tokens/s as the cluster size grows from one to four supernodes, achieving a linearity of 91% [24]. Group 4: Future Implications - The advancements in RL post-training technologies position Huawei's CloudMatrix 384 supernodes as a "super accelerator" for training large models, enhancing speed and efficiency significantly [2][26]. - The innovations in resource utilization and task parallelism are expected to drive the next generation of AI efficiency, marking a pivotal moment in the evolution of large model training [26].