一张卡干俩活，华为要把算力榨干

Core Viewpoint - The article discusses the advancements in AI, particularly focusing on Huawei's innovations in the MoE (Mixture of Experts) architecture and the introduction of RL (Reinforcement Learning) post-training techniques, which aim to enhance the efficiency and performance of large language models (LLMs) in the competitive AI landscape [1][3]. Group 1: MoE Architecture and Huawei's Innovations - The MoE model, originally proposed by Canadian scholars, has evolved significantly, with Huawei introducing the MoGE architecture that addresses inefficiencies in the traditional MoE model, leading to cost reduction and improved training and deployment [1]. - Huawei's approach emphasizes the importance of creating a collaborative ecosystem to foster the growth of the Ascend ecosystem in China [1]. Group 2: RL Post-Training Techniques - RL post-training has emerged as a critical pathway to enhance LLM performance, with models like OpenAI's o1 and DeepSeek-R1 leveraging this technique to improve reasoning capabilities in complex tasks [3][5]. - The RL post-training phase currently consumes 20% of the total computational resources, projected to rise to 50%, significantly impacting model performance and costs [3]. Group 3: Challenges in RL Post-Training - The traditional On-Policy algorithms create a "computational black hole" due to the alternating execution of training and inference tasks, leading to underutilization of resources [6][7]. - The complexity of task scheduling in large-scale clusters, exacerbated by the adoption of various parallel strategies, poses significant challenges for efficient resource utilization [8]. Group 4: Innovations in Resource Utilization - Huawei's RL Fusion technology allows a single card to handle both training and inference tasks simultaneously, effectively doubling resource utilization and throughput [9][10]. - The StaleSync mechanism enables near-asynchronous execution of tasks, achieving over 90% efficiency in horizontal scaling across CloudMatrix 384 super nodes [16][20]. Group 5: Performance Metrics and Results - The combination of RL Fusion and StaleSync has led to a significant increase in efficiency, with single-node throughput improving by 78.5% and overall performance enhancement of 1.5 times [30][31]. - StaleSync's implementation in cluster scaling shows a linear throughput increase from 35k tokens/s to 127k tokens/s as the number of super nodes increases, demonstrating its effectiveness in enhancing scalability [32]. Group 6: Conclusion - The advancements in RL post-training techniques by Huawei represent a significant leap in AI efficiency, positioning the company as a key player in the next generation of AI technology [33].