Workflow
Pangu Pro MoE 72B模型
icon
Search documents
MoE推理「王炸」组合:昇腾×盘古让推理性能狂飙6-8倍
机器之心· 2025-06-06 09:36
Core Viewpoint - The article emphasizes the significant advancements in the Pangu Pro MoE 72B model developed by Huawei, highlighting its efficiency in large model inference through innovative techniques and optimizations, which have led to substantial performance improvements in AI applications [2][23]. Group 1: Model Performance and Optimization - The Pangu Pro MoE model achieves a 6-8 times improvement in inference performance through system-level optimizations, including high-performance operator fusion and model-native speculative algorithms [3][23]. - The model's throughput reaches 321 tokens/s on the Ascend 300I Duo and can soar to 1528 tokens/s on the Ascend 800I A2, showcasing its capability to fully leverage hardware potential [3][24]. Group 2: Hierarchical and Hybrid Parallelism - Huawei introduces a novel Hierarchical & Hybrid Parallelism (H P) strategy, which enhances efficiency by allowing specialized communication and computation without the need for all components to engage simultaneously [6][7]. - This strategy results in a 33.1% increase in decode throughput compared to traditional parallel processing methods [7]. Group 3: Communication Optimization - The TopoComm optimization scheme reduces static overhead and improves data transmission efficiency, achieving a 35% reduction in synchronization operations and a 21% increase in effective bandwidth [9][12]. - The introduction of mixed quantization communication strategies leads to a 25% reduction in communication data size and a 39% decrease in AllGather communication time [9]. Group 4: Operator Fusion and Efficiency - The development of fusion operators like MulAttention and SwiftGMM addresses the inefficiencies of traditional operators, significantly enhancing memory access and computation scheduling [15][18]. - MulAttention achieves a 4.5 times acceleration in attention computation, while SwiftGMM reduces inference latency by 48.7% [16][18]. Group 5: Dynamic Pruning and Collaborative Optimization - The PreMoE dynamic pruning algorithm enhances inference throughput by over 10% by selectively activating relevant experts for specific tasks [21]. - The TrimR and SpecReason algorithms optimize the reasoning process, reducing unnecessary computation and improving throughput by 30% [20][22]. Group 6: Overall System Optimization - The comprehensive optimization of the Ascend Pangu inference system establishes a robust foundation for high-performance, large-scale, and cost-effective AI model deployment [28].
生于昇腾,快人一步:盘古Pro MoE全链路优化推理系统揭秘
雷峰网· 2025-06-06 09:26
Core Viewpoint - Huawei's Pangu Pro MoE 72B model significantly enhances inference efficiency through system-level optimizations and innovative parallel processing strategies, establishing a benchmark in the MoE inference landscape [2][25]. Group 1: Model and Performance Enhancements - The Pangu Pro MoE model reduces computational overhead and ranks first domestically in the SuperCLUE benchmark for models with over 100 billion parameters [2]. - The inference performance of the Pangu Pro MoE model is improved by 6-8 times, achieving a throughput of 321 tokens/s on the Ascend 300I Duo and up to 1528 tokens/s on the Ascend 800I A2 [2][26]. Group 2: Optimization Strategies - The Hierarchical & Hybrid Parallelism (H2P) strategy enhances efficiency by allowing specialized communication within modules, avoiding the inefficiencies of traditional parallel processing [4][5]. - The TopoComm optimization reduces static overhead and improves data transmission efficiency, achieving a 21% increase in effective bandwidth and a 39% reduction in AllGather communication time [6][12]. - The DuoStream strategy integrates computation and communication, allowing simultaneous execution of tasks, which significantly boosts overall efficiency [8][10]. Group 3: Operator Fusion - Huawei has developed two specialized fused operators, MulAttention and SwiftGMM, to optimize resource access and computation scheduling, leading to substantial performance improvements in inference tasks [13][14]. - The MulAttention operator accelerates attention computation by 4.5 times, while the SwiftGMM operator reduces decoding latency by 48.7% [15][18]. Group 4: Algorithmic Innovations - The PreMoE algorithm dynamically prunes experts in the MoE model, enhancing throughput by over 10% while maintaining accuracy [22]. - The TrimR and SpecReason algorithms optimize the reasoning process, reducing unnecessary computation and improving throughput by 14% and 30%, respectively [23][21]. Group 5: Overall System Performance - The Ascend 300I Duo platform demonstrates exceptional performance with low latency and high throughput, achieving 321 tokens/s under optimal conditions, making it a cost-effective solution for various inference applications [29][30]. - The comprehensive optimization of the Pangu inference system establishes a robust foundation for large-scale deployment and efficient implementation of general large models [31].