昇腾+鲲鹏联手上大招！华为爆改MoE训练，吞吐再飙升20%，内存省70%

Core Insights - Huawei has introduced new solutions for MoE training systems, achieving a 20% increase in system throughput and a 70% reduction in memory usage through three core operator optimizations [1][4][33] Group 1: MoE Training System Enhancements - MoE has become a preferred path for tech giants towards more powerful AI [2] - The scaling law indicates that as long as it holds, the parameter scale of large models will continue to expand, enhancing AI intelligence levels [3] - Huawei's previous Adaptive Pipe & EDPB framework improved distributed computing efficiency, and the latest advancements further enhance training operator efficiency and memory utilization [4][5] Group 2: Challenges in MoE Training - MoE model training faces significant challenges, particularly in single-node efficiency [6][7] - Low operator computation efficiency and frequent interruptions due to expert routing mechanisms hinder overall throughput [8][10] - The need for extensive model parameters leads to memory constraints, risking out-of-memory (OOM) errors during training [11][13][14] Group 3: Solutions Proposed by Huawei - Huawei has proposed a comprehensive solution to address the challenges in MoE training [15] - The Ascend operator acceleration has led to a 15% increase in training throughput, with core operators like FlashAttention, MatMul, and Vector accounting for over 75% of total computation time [16][18] - Three optimization strategies—"Slimming," "Balancing," and "Transporting"—have been implemented to enhance computation efficiency [17] Group 4: Specific Operator Optimizations - FlashAttention optimization has improved forward and backward performance by 50% and 30%, respectively [24] - MatMul optimization has increased Cube utilization by 10% through enhanced data transport strategies [28] - Vector operator performance has surged by over 300% due to reduced data transport times [32] Group 5: Collaboration Between Ascend and Kunpeng - The collaboration between Ascend and Kunpeng has achieved nearly zero waiting time for operator dispatch and a 70% reduction in memory usage [33] - Innovations in operator dispatch optimization and Selective R/S memory surgery have been key to these improvements [33][43] - The training throughput has been further enhanced by 4% through effective task binding and scheduling strategies [42] Group 6: Selective R/S Memory Optimization - The Selective R/S memory optimization technique allows for a customized approach to memory management, saving over 70% of activation memory during training [43] - This technique includes fine-grained recomputation and adaptive memory management mechanisms to optimize memory usage [45][51] - The overall strategy aims to maximize the efficiency of memory usage while minimizing additional computation time [52] Group 7: Conclusion - Huawei's deep collaboration between Ascend and Kunpeng, along with operator acceleration and memory optimization technologies, provides an efficient and cost-effective solution for MoE training [53] - These advancements not only remove barriers for large-scale MoE model training but also offer valuable reference paths for the industry [54]