Workflow
混合专家(MoE)模型
icon
Search documents
生于昇腾,快人一步:盘古Pro MoE全链路优化推理系统揭秘
雷峰网· 2025-06-06 09:26
Core Viewpoint - Huawei's Pangu Pro MoE 72B model significantly enhances inference efficiency through system-level optimizations and innovative parallel processing strategies, establishing a benchmark in the MoE inference landscape [2][25]. Group 1: Model and Performance Enhancements - The Pangu Pro MoE model reduces computational overhead and ranks first domestically in the SuperCLUE benchmark for models with over 100 billion parameters [2]. - The inference performance of the Pangu Pro MoE model is improved by 6-8 times, achieving a throughput of 321 tokens/s on the Ascend 300I Duo and up to 1528 tokens/s on the Ascend 800I A2 [2][26]. Group 2: Optimization Strategies - The Hierarchical & Hybrid Parallelism (H2P) strategy enhances efficiency by allowing specialized communication within modules, avoiding the inefficiencies of traditional parallel processing [4][5]. - The TopoComm optimization reduces static overhead and improves data transmission efficiency, achieving a 21% increase in effective bandwidth and a 39% reduction in AllGather communication time [6][12]. - The DuoStream strategy integrates computation and communication, allowing simultaneous execution of tasks, which significantly boosts overall efficiency [8][10]. Group 3: Operator Fusion - Huawei has developed two specialized fused operators, MulAttention and SwiftGMM, to optimize resource access and computation scheduling, leading to substantial performance improvements in inference tasks [13][14]. - The MulAttention operator accelerates attention computation by 4.5 times, while the SwiftGMM operator reduces decoding latency by 48.7% [15][18]. Group 4: Algorithmic Innovations - The PreMoE algorithm dynamically prunes experts in the MoE model, enhancing throughput by over 10% while maintaining accuracy [22]. - The TrimR and SpecReason algorithms optimize the reasoning process, reducing unnecessary computation and improving throughput by 14% and 30%, respectively [23][21]. Group 5: Overall System Performance - The Ascend 300I Duo platform demonstrates exceptional performance with low latency and high throughput, achieving 321 tokens/s under optimal conditions, making it a cost-effective solution for various inference applications [29][30]. - The comprehensive optimization of the Pangu inference system establishes a robust foundation for large-scale deployment and efficient implementation of general large models [31].
上帝视角的昇腾MoE训练智能交通系统,Adaptive Pipe&EDPB让训练效率提升70%
华尔街见闻· 2025-06-03 13:05
Core Viewpoint - The rapid development of large models has made the Mixture of Experts (MoE) model a significant direction for expanding model capabilities due to its unique architectural advantages. However, training efficiency in distributed cluster environments remains a critical challenge that needs to be addressed [1][2]. Group 1: MoE Model Challenges - The training efficiency of MoE models faces two main challenges: (1) Expert parallelism introduces computational and communication waiting times, especially when the model size is large, leading to idle computational units waiting for communication [2][3]. (2) Load imbalance results in some experts being frequently called while others remain underutilized, causing further waiting among computational units [2]. Group 2: Optimization Solutions - Huawei has developed an optimization solution called Adaptive Pipe & EDPB, which aims to eliminate waiting times in MoE training systems by improving communication and load balancing [3][10]. - The AutoDeploy simulation platform allows for rapid analysis of diverse training loads and automatically identifies optimal strategies that match cluster hardware specifications, achieving a 90% accuracy rate in training performance [4]. Group 3: Communication and Load Balancing Innovations - The Adaptive Pipe communication framework achieves over 98% communication masking, allowing computations to proceed without waiting for communication [6][7]. - EDPB global load balancing enhances training efficiency by 25.5% by ensuring balanced expert scheduling during the training process [10]. Group 4: Dynamic Load Balancing Techniques - The team introduced expert dynamic migration technology, which allows for intelligent movement of experts between distributed devices based on predicted load trends, thus addressing load imbalance issues [12][14]. - A dynamic data rearrangement scheme was proposed to minimize computation time without sacrificing training accuracy, achieving load balancing during pre-training [14]. Group 5: Overall System Benefits - The combination of Adaptive Pipe & EDPB has led to a 72.6% increase in end-to-end training throughput for the Pangu Ultra MoE 718B model, demonstrating significant improvements in training efficiency [17].
专家一半时间在摸鱼?Adaptive Pipe & EDPB让昇腾MoE训练效率提升70%
雷峰网· 2025-06-03 07:17
Core Viewpoint - The article discusses the challenges and solutions related to the training efficiency of the Mixture of Experts (MoE) models, highlighting that over half of the training time is wasted on waiting due to communication and load imbalance issues [2][3][4]. Group 1: MoE Model Training Challenges - The efficiency of MoE model training clusters faces two main challenges: communication waiting due to expert parallelism and load imbalance leading to computation waiting [4]. - The communication waiting arises from the need for All-to-All communication when splitting experts across devices, causing idle computation units [4]. - Load imbalance occurs as some experts are frequently called while others remain underutilized, exacerbated by varying lengths of training data and differences in computational loads across model layers [4]. Group 2: Solutions Implemented - Huawei developed the Adaptive Pipe and EDPB optimization solutions to enhance MoE training efficiency, likening the system to a smart traffic hub that eliminates waiting [5][22]. - The AutoDeploy simulation platform allows for rapid analysis and optimization of training loads, achieving 90% accuracy in finding optimal strategies for hardware specifications [8][22]. - The Adaptive Pipe communication framework achieves over 98% communication masking, allowing computations to proceed without waiting for communication [10][11]. Group 3: Performance Improvements - The EDPB global load balancing technique improves throughput by 25.5% by ensuring balanced expert scheduling during training [14]. - The system's end-to-end training throughput increased by 72.6% in the Pangu Ultra MoE 718B model training, demonstrating significant performance gains [22][23].