MoE训练

Search documents
昇腾+鲲鹏联手上大招!华为爆改MoE训练,吞吐再飙升20%,内存省70%
华尔街见闻· 2025-06-04 11:01
Core Insights - Huawei has introduced new solutions for MoE training systems, achieving a 20% increase in system throughput and a 70% reduction in memory usage through three core operator optimizations [1][4][33] Group 1: MoE Training System Enhancements - MoE has become a preferred path for tech giants towards more powerful AI [2] - The scaling law indicates that as long as it holds, the parameter scale of large models will continue to expand, enhancing AI intelligence levels [3] - Huawei's previous Adaptive Pipe & EDPB framework improved distributed computing efficiency, and the latest advancements further enhance training operator efficiency and memory utilization [4][5] Group 2: Challenges in MoE Training - MoE model training faces significant challenges, particularly in single-node efficiency [6][7] - Low operator computation efficiency and frequent interruptions due to expert routing mechanisms hinder overall throughput [8][10] - The need for extensive model parameters leads to memory constraints, risking out-of-memory (OOM) errors during training [11][13][14] Group 3: Solutions Proposed by Huawei - Huawei has proposed a comprehensive solution to address the challenges in MoE training [15] - The Ascend operator acceleration has led to a 15% increase in training throughput, with core operators like FlashAttention, MatMul, and Vector accounting for over 75% of total computation time [16][18] - Three optimization strategies—"Slimming," "Balancing," and "Transporting"—have been implemented to enhance computation efficiency [17] Group 4: Specific Operator Optimizations - FlashAttention optimization has improved forward and backward performance by 50% and 30%, respectively [24] - MatMul optimization has increased Cube utilization by 10% through enhanced data transport strategies [28] - Vector operator performance has surged by over 300% due to reduced data transport times [32] Group 5: Collaboration Between Ascend and Kunpeng - The collaboration between Ascend and Kunpeng has achieved nearly zero waiting time for operator dispatch and a 70% reduction in memory usage [33] - Innovations in operator dispatch optimization and Selective R/S memory surgery have been key to these improvements [33][43] - The training throughput has been further enhanced by 4% through effective task binding and scheduling strategies [42] Group 6: Selective R/S Memory Optimization - The Selective R/S memory optimization technique allows for a customized approach to memory management, saving over 70% of activation memory during training [43] - This technique includes fine-grained recomputation and adaptive memory management mechanisms to optimize memory usage [45][51] - The overall strategy aims to maximize the efficiency of memory usage while minimizing additional computation time [52] Group 7: Conclusion - Huawei's deep collaboration between Ascend and Kunpeng, along with operator acceleration and memory optimization technologies, provides an efficient and cost-effective solution for MoE training [53] - These advancements not only remove barriers for large-scale MoE model training but also offer valuable reference paths for the industry [54]
昇腾+鲲鹏双核暴击!华为打通MoE训练任督二脉再加速20%,内存省70%
雷峰网· 2025-06-04 09:31
令人惊喜的是,结果显示, MOE 训练在之前的基础上,吞吐又提升了 20% ,内存占用降低了 70% 。 这不仅是一次技术突破,更是引领 MoE 训练的风向标。 " Pangu Ultra MoE 的每一项突破,都体现了华为在AI底层技术 与工程化落地中的领先实力。 " 作者丨李希 最近,华为在 MoE 训练系统方面,给出了 MoE 训练算子和内存优化新方案:三大核心算子全面提速, 系统吞吐再提 20% , Selective R/S 实现内存节省 70% 。 在通往更强大的 AI 路上, MoE 已成为科技巨头另一个首选路径。 只要 Scaling Law 没有失效,大模型的参数规模依旧不断扩大,由此 AI 智能水平才能不断攀升。 凭借独特的架构设计, MoE 正以前所未有的参数规模,成为突破大规模模型训练的算力瓶颈的关键路径 之一。 然而,如何将 MoE 潜力真正转化为高效的训练实践,一直是业界探索的难题。 此前,华为曾通过 Adaptive Pipe&EDPB 框架,实现了集群级高效分布式计算,让通信和计算能完美并 行,提高训练集群效率。 本次,华为通过昇腾与鲲鹏算力的深度协同,进一步实现了训练算子计算 ...