爆改大模型训练，华为打出昇腾+鲲鹏组合拳

Core Viewpoint - The article discusses Huawei's advancements in AI training, particularly through the optimization of the Mixture of Experts (MoE) model architecture, which enhances efficiency and reduces costs in AI model training [1][34]. Group 1: MoE Model and Its Challenges - The MoE model has become a preferred path for tech giants in developing stronger AI systems, with its unique architecture addressing the computational bottlenecks of large-scale model training [2]. - Huawei has identified two main challenges in improving single-node training efficiency: low operator computation efficiency and insufficient NPU memory [6][7]. Group 2: Enhancements in Training Efficiency - Huawei's collaboration between Ascend and Kunpeng has significantly improved training operator computation efficiency and memory utilization, achieving a 20% increase in throughput and a 70% reduction in memory usage [3][18]. - The article highlights three optimization strategies for core operators in MoE models: "Slimming Technique" for FlashAttention, "Balancing Technique" for MatMul, and "Transport Technique" for Vector operators, leading to a 15% increase in overall training throughput [9][10][13]. Group 3: Operator Dispatch Optimization - The article details how Huawei's optimizations have led to nearly zero waiting time for operator dispatch, enhancing the utilization of computational power [19][25]. - The Selective R/S memory optimization technique allows for a 70% reduction in memory for activation values during training, showcasing Huawei's innovative approach to memory management [26][34]. Group 4: Industry Implications - Huawei's advancements in AI training not only clear obstacles for large-scale MoE model training but also provide valuable reference paths for the industry, demonstrating the company's deep technical accumulation in AI computing [34].