RL后训练步入超节点时代！华为黑科技榨干算力，一张卡干俩活

Core Viewpoint - Reinforcement Learning (RL) post-training has become a crucial path for breaking through the performance ceiling of large language models (LLMs), with Huawei introducing two key technologies to enhance efficiency and resource utilization in this process [2][3][56]. Group 1: RL Post-Training Challenges - RL post-training currently consumes 20% of the total computational power in the training process, projected to rise to 50%, significantly impacting model performance and costs [3]. - Traditional RL post-training suffers from low resource utilization due to the alternating execution of training and inference tasks, leading to substantial computational waste [11][13]. - The complexity of task scheduling in large-scale clusters has increased due to the popularity of Mixture of Experts (MoE) models, making efficient collaboration challenging [15][16]. Group 2: Huawei's Innovations - Huawei's "RL Fusion" technology allows a single card to handle both training and inference tasks simultaneously, effectively doubling resource utilization and throughput [5][18]. - The "StaleSync" mechanism enables a quasi-asynchronous approach, allowing different RL tasks to execute in parallel within a defined "staleness threshold," improving horizontal scaling efficiency to over 90% [29][32]. - The combination of RL Fusion and StaleSync technologies significantly enhances the efficiency of RL post-training, achieving a throughput increase of 1.5 times [52][56]. Group 3: Performance Metrics - The implementation of RL Fusion can lead to a throughput increase from 14.0k tokens/sec to 35.0k tokens/sec when combined with StaleSync, representing a 150% improvement compared to baseline configurations [54]. - In a multi-node setup, StaleSync allows for linear scaling efficiency, with throughput increasing from 35k tokens/sec to 127k tokens/sec as the number of nodes increases from 1 to 4, achieving a linearity of 91% [55].