Workflow
Pangu Ultra MoE 718B模型
icon
Search documents
昇腾+鲲鹏联手上大招!华为爆改MoE训练,吞吐再飙升20%,内存省70%
华尔街见闻· 2025-06-04 11:01
Core Insights - Huawei has introduced new solutions for MoE training systems, achieving a 20% increase in system throughput and a 70% reduction in memory usage through three core operator optimizations [1][4][33] Group 1: MoE Training System Enhancements - MoE has become a preferred path for tech giants towards more powerful AI [2] - The scaling law indicates that as long as it holds, the parameter scale of large models will continue to expand, enhancing AI intelligence levels [3] - Huawei's previous Adaptive Pipe & EDPB framework improved distributed computing efficiency, and the latest advancements further enhance training operator efficiency and memory utilization [4][5] Group 2: Challenges in MoE Training - MoE model training faces significant challenges, particularly in single-node efficiency [6][7] - Low operator computation efficiency and frequent interruptions due to expert routing mechanisms hinder overall throughput [8][10] - The need for extensive model parameters leads to memory constraints, risking out-of-memory (OOM) errors during training [11][13][14] Group 3: Solutions Proposed by Huawei - Huawei has proposed a comprehensive solution to address the challenges in MoE training [15] - The Ascend operator acceleration has led to a 15% increase in training throughput, with core operators like FlashAttention, MatMul, and Vector accounting for over 75% of total computation time [16][18] - Three optimization strategies—"Slimming," "Balancing," and "Transporting"—have been implemented to enhance computation efficiency [17] Group 4: Specific Operator Optimizations - FlashAttention optimization has improved forward and backward performance by 50% and 30%, respectively [24] - MatMul optimization has increased Cube utilization by 10% through enhanced data transport strategies [28] - Vector operator performance has surged by over 300% due to reduced data transport times [32] Group 5: Collaboration Between Ascend and Kunpeng - The collaboration between Ascend and Kunpeng has achieved nearly zero waiting time for operator dispatch and a 70% reduction in memory usage [33] - Innovations in operator dispatch optimization and Selective R/S memory surgery have been key to these improvements [33][43] - The training throughput has been further enhanced by 4% through effective task binding and scheduling strategies [42] Group 6: Selective R/S Memory Optimization - The Selective R/S memory optimization technique allows for a customized approach to memory management, saving over 70% of activation memory during training [43] - This technique includes fine-grained recomputation and adaptive memory management mechanisms to optimize memory usage [45][51] - The overall strategy aims to maximize the efficiency of memory usage while minimizing additional computation time [52] Group 7: Conclusion - Huawei's deep collaboration between Ascend and Kunpeng, along with operator acceleration and memory optimization technologies, provides an efficient and cost-effective solution for MoE training [53] - These advancements not only remove barriers for large-scale MoE model training but also offer valuable reference paths for the industry [54]
上帝视角的昇腾MoE训练智能交通系统,Adaptive Pipe&EDPB让训练效率提升70%
华尔街见闻· 2025-06-03 13:05
Core Viewpoint - The rapid development of large models has made the Mixture of Experts (MoE) model a significant direction for expanding model capabilities due to its unique architectural advantages. However, training efficiency in distributed cluster environments remains a critical challenge that needs to be addressed [1][2]. Group 1: MoE Model Challenges - The training efficiency of MoE models faces two main challenges: (1) Expert parallelism introduces computational and communication waiting times, especially when the model size is large, leading to idle computational units waiting for communication [2][3]. (2) Load imbalance results in some experts being frequently called while others remain underutilized, causing further waiting among computational units [2]. Group 2: Optimization Solutions - Huawei has developed an optimization solution called Adaptive Pipe & EDPB, which aims to eliminate waiting times in MoE training systems by improving communication and load balancing [3][10]. - The AutoDeploy simulation platform allows for rapid analysis of diverse training loads and automatically identifies optimal strategies that match cluster hardware specifications, achieving a 90% accuracy rate in training performance [4]. Group 3: Communication and Load Balancing Innovations - The Adaptive Pipe communication framework achieves over 98% communication masking, allowing computations to proceed without waiting for communication [6][7]. - EDPB global load balancing enhances training efficiency by 25.5% by ensuring balanced expert scheduling during the training process [10]. Group 4: Dynamic Load Balancing Techniques - The team introduced expert dynamic migration technology, which allows for intelligent movement of experts between distributed devices based on predicted load trends, thus addressing load imbalance issues [12][14]. - A dynamic data rearrangement scheme was proposed to minimize computation time without sacrificing training accuracy, achieving load balancing during pre-training [14]. Group 5: Overall System Benefits - The combination of Adaptive Pipe & EDPB has led to a 72.6% increase in end-to-end training throughput for the Pangu Ultra MoE 718B model, demonstrating significant improvements in training efficiency [17].
专家一半时间在摸鱼?Adaptive Pipe & EDPB让昇腾MoE训练效率提升70%
雷峰网· 2025-06-03 07:17
Core Viewpoint - The article discusses the challenges and solutions related to the training efficiency of the Mixture of Experts (MoE) models, highlighting that over half of the training time is wasted on waiting due to communication and load imbalance issues [2][3][4]. Group 1: MoE Model Training Challenges - The efficiency of MoE model training clusters faces two main challenges: communication waiting due to expert parallelism and load imbalance leading to computation waiting [4]. - The communication waiting arises from the need for All-to-All communication when splitting experts across devices, causing idle computation units [4]. - Load imbalance occurs as some experts are frequently called while others remain underutilized, exacerbated by varying lengths of training data and differences in computational loads across model layers [4]. Group 2: Solutions Implemented - Huawei developed the Adaptive Pipe and EDPB optimization solutions to enhance MoE training efficiency, likening the system to a smart traffic hub that eliminates waiting [5][22]. - The AutoDeploy simulation platform allows for rapid analysis and optimization of training loads, achieving 90% accuracy in finding optimal strategies for hardware specifications [8][22]. - The Adaptive Pipe communication framework achieves over 98% communication masking, allowing computations to proceed without waiting for communication [10][11]. Group 3: Performance Improvements - The EDPB global load balancing technique improves throughput by 25.5% by ensuring balanced expert scheduling during training [14]. - The system's end-to-end training throughput increased by 72.6% in the Pangu Ultra MoE 718B model training, demonstrating significant performance gains [22][23].
训练MoE足足提速70%!华为只用了3招
量子位· 2025-06-03 06:21
训练效率不足 ,甚至 一半以上训练时间都浪费在"等待"上 。 现在,为了突破MoE的训练瓶颈,华为出手了: 构建了一套名为 Adaptive Pipe & EDPB 的优化方案,开启"上帝视角",让MoE面临"交通拥堵"的训练集群, 实现无等待流畅运行。 MoE大规模训练难题:一半以上的训练时间在等待? 实践已经表明,MoE模型训练集群的效率面临两方面挑战: 首先,是 专家并行引入了计算和通信等待 。 允中 发自 凹非寺 量子位 | 公众号 QbitAI Scaling Law之下,MoE(混合专家)如今已经成为各大模型厂商扩展模型能力的制胜法宝。 不过,在高效实现模型参数规模化的同时,MoE的 训练难题 也日益凸显: 当模型规模较大时,需要切分专家到不同设备形成并行(EP),这就引入额外All-to-All通信。 与此同时,MoE层绝大部分EP通信与计算存在时序依赖关系,一般的串行执行模式会导致大量计算单元空闲, 等待通信。 其次, 负载不均会引入计算和计算等待 。 MoE算法核心是"有能者居之",在训练过程中会出现部分热专家被频繁调用,而冷专家使用率较低的情况。 同时,真实训练数据的长度不一,不同的模型层 ...
华为AI实力!不用GPU,大模型每2秒吃透一道高数大题!
第一财经· 2025-05-30 09:32
Core Viewpoint - Huawei has achieved significant advancements in training large models through its "Ascend + Pangu Ultra MoE" combination, enabling a fully controllable training process without the need for GPUs, showcasing industry-leading performance in cluster training systems [2][3]. Group 1: Technical Innovations - Huawei's training system has improved the model training efficiency significantly, with a pre-training model utilization rate (MFU) reaching 41% and a post-training throughput of 35K Tokens/s on the CloudMatrix 384 super node [3][34]. - The company has introduced a series of innovative solutions to address challenges in the MoE pre-training and reinforcement learning (RL) post-training processes, including intelligent parallel strategy selection and global dynamic load balancing [11][17]. - The training system utilizes a hierarchical All-to-All communication architecture to reduce communication overhead to nearly zero, enhancing the efficiency of expert parallel communication [14][15]. Group 2: Training Process Optimization - The training cluster's utilization has been optimized through a simulation-driven intelligent parallel optimization framework, which automates the selection of optimal deployment configurations [12][13]. - The team has implemented a memory optimization framework that achieves over 70% savings in activation memory, ensuring reliable long-term training even under increased memory pressure [25]. - The RL Fusion technology allows for flexible deployment modes, significantly improving resource scheduling during the inference phase and doubling the utilization rate in RL post-training [27][28]. Group 3: Model Specifications - The Pangu Ultra MoE model features 718 billion parameters, with a structure that includes 61 layers of Transformer architecture, designed for high sparsity and performance [32]. - The model's training utilized a cluster of 6K - 10K Ascend 800T A2 cards, achieving a high model utilization rate during the pre-training phase [32]. - The architecture supports efficient scaling to larger parameter models and clusters, with expectations of achieving an MFU greater than 50% in future iterations [32].