SwiftGMM

Search documents
大模型推理,得讲性价比
虎嗅APP· 2025-06-06 10:10
Core Insights - The article discusses the evolution and optimization of the Mixture of Experts (MoE) model, highlighting Huawei's innovative MoGE architecture that addresses inefficiencies in the original MoE model and enhances cost-effectiveness and deployment ease [1][3]. Group 1: MoE Model Evolution - The MoE model has become a key path for improving large model inference efficiency due to its dynamic sparse computing advantages [3]. - Huawei's Pangu Pro MoE 72B model significantly reduces computational costs and ranks first domestically in the SuperCLUE benchmark for models with over 100 billion parameters [3]. - The Pangu Pro MoE model achieves a 6-8 times improvement in inference performance through system-level optimizations and can reach a throughput of 321 tokens/s on the Ascend 300I Duo [3][30]. Group 2: Optimization Strategies - Huawei's H2P (Hierarchical & Hybrid Parallelism) strategy enhances inference efficiency by allowing specialized communication within task-specific groups rather than a "full team meeting" approach [5][6]. - The TopoComm optimization focuses on reducing communication overhead and improving data transmission efficiency, achieving a 35% reduction in synchronization operations [8][10]. - The DuoStream optimization allows for concurrent execution of communication and computation, significantly improving inference efficiency [11]. Group 3: Operator Fusion - Huawei has developed two specialized fusion operators, MulAttention and SwiftGMM, to optimize resource access and computation scheduling, leading to substantial performance improvements [15][17]. - MulAttention enhances attention computation speed by 4.5 times and achieves over 89% data transfer efficiency [17]. - SwiftGMM accelerates GMM computation by 2.1 times and reduces end-to-end inference latency by 48.7% [20]. Group 4: Inference Algorithm Acceleration - The PreMoE algorithm dynamically prunes experts in the MoE model, improving throughput by over 10% while maintaining accuracy [25]. - The TrimR algorithm reduces unnecessary inference steps by 14% by monitoring and adjusting the model's reasoning process [26]. - The SpecReason algorithm leverages smaller models to enhance the efficiency of larger models, resulting in a 30% increase in throughput [27]. Group 5: Performance Breakthroughs - The Ascend 800I A2 platform demonstrates exceptional performance with a single-card throughput of 1528 tokens/s under optimal conditions [30][31]. - The Ascend 300I Duo platform offers a cost-effective solution for MoE model inference, achieving a maximum throughput of 321 tokens/s [32][33]. - Overall, Huawei's optimizations have established a robust foundation for high-performance, large-scale, and low-cost inference capabilities [33].