生于昇腾，快人一步：盘古Pro MoE全链路优化推理系统揭秘

Core Viewpoint - Huawei's Pangu Pro MoE 72B model significantly enhances inference efficiency through system-level optimizations and innovative parallel processing strategies, establishing a benchmark in the MoE inference landscape [2][25]. Group 1: Model and Performance Enhancements - The Pangu Pro MoE model reduces computational overhead and ranks first domestically in the SuperCLUE benchmark for models with over 100 billion parameters [2]. - The inference performance of the Pangu Pro MoE model is improved by 6-8 times, achieving a throughput of 321 tokens/s on the Ascend 300I Duo and up to 1528 tokens/s on the Ascend 800I A2 [2][26]. Group 2: Optimization Strategies - The Hierarchical & Hybrid Parallelism (H2P) strategy enhances efficiency by allowing specialized communication within modules, avoiding the inefficiencies of traditional parallel processing [4][5]. - The TopoComm optimization reduces static overhead and improves data transmission efficiency, achieving a 21% increase in effective bandwidth and a 39% reduction in AllGather communication time [6][12]. - The DuoStream strategy integrates computation and communication, allowing simultaneous execution of tasks, which significantly boosts overall efficiency [8][10]. Group 3: Operator Fusion - Huawei has developed two specialized fused operators, MulAttention and SwiftGMM, to optimize resource access and computation scheduling, leading to substantial performance improvements in inference tasks [13][14]. - The MulAttention operator accelerates attention computation by 4.5 times, while the SwiftGMM operator reduces decoding latency by 48.7% [15][18]. Group 4: Algorithmic Innovations - The PreMoE algorithm dynamically prunes experts in the MoE model, enhancing throughput by over 10% while maintaining accuracy [22]. - The TrimR and SpecReason algorithms optimize the reasoning process, reducing unnecessary computation and improving throughput by 14% and 30%, respectively [23][21]. Group 5: Overall System Performance - The Ascend 300I Duo platform demonstrates exceptional performance with low latency and high throughput, achieving 321 tokens/s under optimal conditions, making it a cost-effective solution for various inference applications [29][30]. - The comprehensive optimization of the Pangu inference system establishes a robust foundation for large-scale deployment and efficient implementation of general large models [31].