Workflow
Pangu Ultra MoE 718B模型
icon
Search documents
昇腾+鲲鹏联手上大招!华为爆改MoE训练,吞吐再飙升20%,内存省70%
华尔街见闻· 2025-06-04 11:01
Core Insights - Huawei has introduced new solutions for MoE training systems, achieving a 20% increase in system throughput and a 70% reduction in memory usage through three core operator optimizations [1][4][33] Group 1: MoE Training System Enhancements - MoE has become a preferred path for tech giants towards more powerful AI [2] - The scaling law indicates that as long as it holds, the parameter scale of large models will continue to expand, enhancing AI intelligence levels [3] - Huawei's previous Adaptive Pipe & EDPB framework improved distributed computing efficiency, and the latest advancements further enhance training operator efficiency and memory utilization [4][5] Group 2: Challenges in MoE Training - MoE model training faces significant challenges, particularly in single-node efficiency [6][7] - Low operator computation efficiency and frequent interruptions due to expert routing mechanisms hinder overall throughput [8][10] - The need for extensive model parameters leads to memory constraints, risking out-of-memory (OOM) errors during training [11][13][14] Group 3: Solutions Proposed by Huawei - Huawei has proposed a comprehensive solution to address the challenges in MoE training [15] - The Ascend operator acceleration has led to a 15% increase in training throughput, with core operators like FlashAttention, MatMul, and Vector accounting for over 75% of total computation time [16][18] - Three optimization strategies—"Slimming," "Balancing," and "Transporting"—have been implemented to enhance computation efficiency [17] Group 4: Specific Operator Optimizations - FlashAttention optimization has improved forward and backward performance by 50% and 30%, respectively [24] - MatMul optimization has increased Cube utilization by 10% through enhanced data transport strategies [28] - Vector operator performance has surged by over 300% due to reduced data transport times [32] Group 5: Collaboration Between Ascend and Kunpeng - The collaboration between Ascend and Kunpeng has achieved nearly zero waiting time for operator dispatch and a 70% reduction in memory usage [33] - Innovations in operator dispatch optimization and Selective R/S memory surgery have been key to these improvements [33][43] - The training throughput has been further enhanced by 4% through effective task binding and scheduling strategies [42] Group 6: Selective R/S Memory Optimization - The Selective R/S memory optimization technique allows for a customized approach to memory management, saving over 70% of activation memory during training [43] - This technique includes fine-grained recomputation and adaptive memory management mechanisms to optimize memory usage [45][51] - The overall strategy aims to maximize the efficiency of memory usage while minimizing additional computation time [52] Group 7: Conclusion - Huawei's deep collaboration between Ascend and Kunpeng, along with operator acceleration and memory optimization technologies, provides an efficient and cost-effective solution for MoE training [53] - These advancements not only remove barriers for large-scale MoE model training but also offer valuable reference paths for the industry [54]
上帝视角的昇腾MoE训练智能交通系统,Adaptive Pipe&EDPB让训练效率提升70%
华尔街见闻· 2025-06-03 13:05
Core Viewpoint - The rapid development of large models has made the Mixture of Experts (MoE) model a significant direction for expanding model capabilities due to its unique architectural advantages. However, training efficiency in distributed cluster environments remains a critical challenge that needs to be addressed [1][2]. Group 1: MoE Model Challenges - The training efficiency of MoE models faces two main challenges: (1) Expert parallelism introduces computational and communication waiting times, especially when the model size is large, leading to idle computational units waiting for communication [2][3]. (2) Load imbalance results in some experts being frequently called while others remain underutilized, causing further waiting among computational units [2]. Group 2: Optimization Solutions - Huawei has developed an optimization solution called Adaptive Pipe & EDPB, which aims to eliminate waiting times in MoE training systems by improving communication and load balancing [3][10]. - The AutoDeploy simulation platform allows for rapid analysis of diverse training loads and automatically identifies optimal strategies that match cluster hardware specifications, achieving a 90% accuracy rate in training performance [4]. Group 3: Communication and Load Balancing Innovations - The Adaptive Pipe communication framework achieves over 98% communication masking, allowing computations to proceed without waiting for communication [6][7]. - EDPB global load balancing enhances training efficiency by 25.5% by ensuring balanced expert scheduling during the training process [10]. Group 4: Dynamic Load Balancing Techniques - The team introduced expert dynamic migration technology, which allows for intelligent movement of experts between distributed devices based on predicted load trends, thus addressing load imbalance issues [12][14]. - A dynamic data rearrangement scheme was proposed to minimize computation time without sacrificing training accuracy, achieving load balancing during pre-training [14]. Group 5: Overall System Benefits - The combination of Adaptive Pipe & EDPB has led to a 72.6% increase in end-to-end training throughput for the Pangu Ultra MoE 718B model, demonstrating significant improvements in training efficiency [17].
专家一半时间在摸鱼?Adaptive Pipe & EDPB让昇腾MoE训练效率提升70%
雷峰网· 2025-06-03 07:17
Core Viewpoint - The article discusses the challenges and solutions related to the training efficiency of the Mixture of Experts (MoE) models, highlighting that over half of the training time is wasted on waiting due to communication and load imbalance issues [2][3][4]. Group 1: MoE Model Training Challenges - The efficiency of MoE model training clusters faces two main challenges: communication waiting due to expert parallelism and load imbalance leading to computation waiting [4]. - The communication waiting arises from the need for All-to-All communication when splitting experts across devices, causing idle computation units [4]. - Load imbalance occurs as some experts are frequently called while others remain underutilized, exacerbated by varying lengths of training data and differences in computational loads across model layers [4]. Group 2: Solutions Implemented - Huawei developed the Adaptive Pipe and EDPB optimization solutions to enhance MoE training efficiency, likening the system to a smart traffic hub that eliminates waiting [5][22]. - The AutoDeploy simulation platform allows for rapid analysis and optimization of training loads, achieving 90% accuracy in finding optimal strategies for hardware specifications [8][22]. - The Adaptive Pipe communication framework achieves over 98% communication masking, allowing computations to proceed without waiting for communication [10][11]. Group 3: Performance Improvements - The EDPB global load balancing technique improves throughput by 25.5% by ensuring balanced expert scheduling during training [14]. - The system's end-to-end training throughput increased by 72.6% in the Pangu Ultra MoE 718B model training, demonstrating significant performance gains [22][23].
训练MoE足足提速70%!华为只用了3招
量子位· 2025-06-03 06:21
Core Viewpoint - The article discusses how the MoE (Mixture of Experts) model has become a key tool for model vendors to scale model capabilities under the Scaling Law, while also highlighting the training challenges associated with MoE, particularly inefficiencies leading to over half of the training time being wasted on waiting [1][2]. Group 1: MoE Training Challenges - The efficiency of MoE model training clusters faces two main challenges: communication and computation waiting due to expert parallelism, and load imbalance leading to additional waiting [4][7]. - When the model size increases, experts need to be split across different devices for parallel processing, which introduces extra All-to-All communication. This results in many computing units being idle while waiting for communication [5]. - The core of the MoE algorithm is "to the victor belong the spoils," leading to some hot experts being frequently called upon while cold experts are underutilized, causing further waiting due to varying computation loads across different model layers [8]. Group 2: Huawei's Solutions - Huawei has developed an optimization solution named Adaptive Pipe & EDPB to address the training bottlenecks of MoE, enabling smooth operation without waiting [3]. - The solution includes a "communication cover" technology that separates computation from communication, allowing calculations to proceed without waiting for data transfer [9]. - The "dynamic expert routing" feature adjusts the load dynamically based on real-time data distribution, achieving load balancing and eliminating communication bottlenecks [9]. Group 3: DeployMind Simulation Platform - Huawei has created the DeployMind simulation platform, which can simulate millions of training scenarios in just one hour, allowing for rapid analysis of diverse training loads and optimal strategy selection [10]. - This modeling framework has achieved a 90% accuracy rate, enabling efficient parallel selection that balances computation, communication, and memory for the Pangu Ultra MoE 718B model [11]. Group 4: Communication Optimization - The Adaptive Pipe communication cover framework achieves over 98% communication cover, allowing computations to proceed without waiting for communication [12][19]. - Huawei's innovative two-step communication process enhances communication speed by reducing inter-machine communication, effectively doubling the speed compared to traditional methods [15][16]. Group 5: Load Balancing and Throughput Improvement - The EDPB (Expert Dynamic Prediction Balancing) technology addresses load imbalance in MoE training, achieving a 25.5% increase in throughput [21][22]. - EDPB features include predictive load trends, dual-layer optimization for computation and communication, and intelligent triggering for expert migration based on pre-evaluated benefits [23][24][25]. - The overall system has achieved a 72.6% increase in end-to-end training throughput for the Pangu Ultra MoE 718B model, demonstrating the effectiveness of Huawei's optimization strategies [29][30].
华为AI实力!不用GPU,大模型每2秒吃透一道高数大题!
第一财经· 2025-05-30 09:32
Core Viewpoint - Huawei has achieved significant advancements in training large models through its "Ascend + Pangu Ultra MoE" combination, enabling a fully controllable training process without the need for GPUs, showcasing industry-leading performance in cluster training systems [2][3]. Group 1: Technical Innovations - Huawei's training system has improved the model training efficiency significantly, with a pre-training model utilization rate (MFU) reaching 41% and a post-training throughput of 35K Tokens/s on the CloudMatrix 384 super node [3][34]. - The company has introduced a series of innovative solutions to address challenges in the MoE pre-training and reinforcement learning (RL) post-training processes, including intelligent parallel strategy selection and global dynamic load balancing [11][17]. - The training system utilizes a hierarchical All-to-All communication architecture to reduce communication overhead to nearly zero, enhancing the efficiency of expert parallel communication [14][15]. Group 2: Training Process Optimization - The training cluster's utilization has been optimized through a simulation-driven intelligent parallel optimization framework, which automates the selection of optimal deployment configurations [12][13]. - The team has implemented a memory optimization framework that achieves over 70% savings in activation memory, ensuring reliable long-term training even under increased memory pressure [25]. - The RL Fusion technology allows for flexible deployment modes, significantly improving resource scheduling during the inference phase and doubling the utilization rate in RL post-training [27][28]. Group 3: Model Specifications - The Pangu Ultra MoE model features 718 billion parameters, with a structure that includes 61 layers of Transformer architecture, designed for high sparsity and performance [32]. - The model's training utilized a cluster of 6K - 10K Ascend 800T A2 cards, achieving a high model utilization rate during the pre-training phase [32]. - The architecture supports efficient scaling to larger parameter models and clusters, with expectations of achieving an MFU greater than 50% in future iterations [32].