Workflow
昇腾800I A2
icon
Search documents
大模型推理,得讲性价比
虎嗅APP· 2025-06-06 10:10
Core Insights - The article discusses the evolution and optimization of the Mixture of Experts (MoE) model, highlighting Huawei's innovative MoGE architecture that addresses inefficiencies in the original MoE model and enhances cost-effectiveness and deployment ease [1][3]. Group 1: MoE Model Evolution - The MoE model has become a key path for improving large model inference efficiency due to its dynamic sparse computing advantages [3]. - Huawei's Pangu Pro MoE 72B model significantly reduces computational costs and ranks first domestically in the SuperCLUE benchmark for models with over 100 billion parameters [3]. - The Pangu Pro MoE model achieves a 6-8 times improvement in inference performance through system-level optimizations and can reach a throughput of 321 tokens/s on the Ascend 300I Duo [3][30]. Group 2: Optimization Strategies - Huawei's H2P (Hierarchical & Hybrid Parallelism) strategy enhances inference efficiency by allowing specialized communication within task-specific groups rather than a "full team meeting" approach [5][6]. - The TopoComm optimization focuses on reducing communication overhead and improving data transmission efficiency, achieving a 35% reduction in synchronization operations [8][10]. - The DuoStream optimization allows for concurrent execution of communication and computation, significantly improving inference efficiency [11]. Group 3: Operator Fusion - Huawei has developed two specialized fusion operators, MulAttention and SwiftGMM, to optimize resource access and computation scheduling, leading to substantial performance improvements [15][17]. - MulAttention enhances attention computation speed by 4.5 times and achieves over 89% data transfer efficiency [17]. - SwiftGMM accelerates GMM computation by 2.1 times and reduces end-to-end inference latency by 48.7% [20]. Group 4: Inference Algorithm Acceleration - The PreMoE algorithm dynamically prunes experts in the MoE model, improving throughput by over 10% while maintaining accuracy [25]. - The TrimR algorithm reduces unnecessary inference steps by 14% by monitoring and adjusting the model's reasoning process [26]. - The SpecReason algorithm leverages smaller models to enhance the efficiency of larger models, resulting in a 30% increase in throughput [27]. Group 5: Performance Breakthroughs - The Ascend 800I A2 platform demonstrates exceptional performance with a single-card throughput of 1528 tokens/s under optimal conditions [30][31]. - The Ascend 300I Duo platform offers a cost-effective solution for MoE model inference, achieving a maximum throughput of 321 tokens/s [32][33]. - Overall, Huawei's optimizations have established a robust foundation for high-performance, large-scale, and low-cost inference capabilities [33].
MoE推理「王炸」组合:昇腾×盘古让推理性能狂飙6-8倍
机器之心· 2025-06-06 09:36
Core Viewpoint - The article emphasizes the significant advancements in the Pangu Pro MoE 72B model developed by Huawei, highlighting its efficiency in large model inference through innovative techniques and optimizations, which have led to substantial performance improvements in AI applications [2][23]. Group 1: Model Performance and Optimization - The Pangu Pro MoE model achieves a 6-8 times improvement in inference performance through system-level optimizations, including high-performance operator fusion and model-native speculative algorithms [3][23]. - The model's throughput reaches 321 tokens/s on the Ascend 300I Duo and can soar to 1528 tokens/s on the Ascend 800I A2, showcasing its capability to fully leverage hardware potential [3][24]. Group 2: Hierarchical and Hybrid Parallelism - Huawei introduces a novel Hierarchical & Hybrid Parallelism (H P) strategy, which enhances efficiency by allowing specialized communication and computation without the need for all components to engage simultaneously [6][7]. - This strategy results in a 33.1% increase in decode throughput compared to traditional parallel processing methods [7]. Group 3: Communication Optimization - The TopoComm optimization scheme reduces static overhead and improves data transmission efficiency, achieving a 35% reduction in synchronization operations and a 21% increase in effective bandwidth [9][12]. - The introduction of mixed quantization communication strategies leads to a 25% reduction in communication data size and a 39% decrease in AllGather communication time [9]. Group 4: Operator Fusion and Efficiency - The development of fusion operators like MulAttention and SwiftGMM addresses the inefficiencies of traditional operators, significantly enhancing memory access and computation scheduling [15][18]. - MulAttention achieves a 4.5 times acceleration in attention computation, while SwiftGMM reduces inference latency by 48.7% [16][18]. Group 5: Dynamic Pruning and Collaborative Optimization - The PreMoE dynamic pruning algorithm enhances inference throughput by over 10% by selectively activating relevant experts for specific tasks [21]. - The TrimR and SpecReason algorithms optimize the reasoning process, reducing unnecessary computation and improving throughput by 30% [20][22]. Group 6: Overall System Optimization - The comprehensive optimization of the Ascend Pangu inference system establishes a robust foundation for high-performance, large-scale, and cost-effective AI model deployment [28].
生于昇腾,快人一步:盘古Pro MoE全链路优化推理系统揭秘
雷峰网· 2025-06-06 09:26
Core Viewpoint - Huawei's Pangu Pro MoE 72B model significantly enhances inference efficiency through system-level optimizations and innovative parallel processing strategies, establishing a benchmark in the MoE inference landscape [2][25]. Group 1: Model and Performance Enhancements - The Pangu Pro MoE model reduces computational overhead and ranks first domestically in the SuperCLUE benchmark for models with over 100 billion parameters [2]. - The inference performance of the Pangu Pro MoE model is improved by 6-8 times, achieving a throughput of 321 tokens/s on the Ascend 300I Duo and up to 1528 tokens/s on the Ascend 800I A2 [2][26]. Group 2: Optimization Strategies - The Hierarchical & Hybrid Parallelism (H2P) strategy enhances efficiency by allowing specialized communication within modules, avoiding the inefficiencies of traditional parallel processing [4][5]. - The TopoComm optimization reduces static overhead and improves data transmission efficiency, achieving a 21% increase in effective bandwidth and a 39% reduction in AllGather communication time [6][12]. - The DuoStream strategy integrates computation and communication, allowing simultaneous execution of tasks, which significantly boosts overall efficiency [8][10]. Group 3: Operator Fusion - Huawei has developed two specialized fused operators, MulAttention and SwiftGMM, to optimize resource access and computation scheduling, leading to substantial performance improvements in inference tasks [13][14]. - The MulAttention operator accelerates attention computation by 4.5 times, while the SwiftGMM operator reduces decoding latency by 48.7% [15][18]. Group 4: Algorithmic Innovations - The PreMoE algorithm dynamically prunes experts in the MoE model, enhancing throughput by over 10% while maintaining accuracy [22]. - The TrimR and SpecReason algorithms optimize the reasoning process, reducing unnecessary computation and improving throughput by 14% and 30%, respectively [23][21]. Group 5: Overall System Performance - The Ascend 300I Duo platform demonstrates exceptional performance with low latency and high throughput, achieving 321 tokens/s under optimal conditions, making it a cost-effective solution for various inference applications [29][30]. - The comprehensive optimization of the Pangu inference system establishes a robust foundation for large-scale deployment and efficient implementation of general large models [31].
首次打榜就登顶,华为盘古如何以小胜大?
虎嗅APP· 2025-05-28 13:34
Core Viewpoint - The article discusses Huawei's innovative Mixture of Grouped Experts (MoGE) architecture, which optimizes the traditional Mixture of Experts (MoE) model to enhance load balancing and computational efficiency in AI applications, particularly in large models [1][2][6]. Summary by Sections Introduction - The MoE model has evolved from its academic origins to become a competitive force in AI, with Huawei's MoGE architecture representing a significant advancement in this field [1]. MoGE Architecture - Huawei's Pangu Pro MoE model features 72 billion total parameters and 16 billion active parameters, achieving superior expert load distribution and computational efficiency [2]. - The model's performance is highlighted by its ranking in the SuperCLUE leaderboard, where it achieved a score of 59, placing it among the top domestic models with fewer parameters compared to competitors [2]. Technical Innovations - The MoGE architecture addresses the core challenge of load imbalance in traditional MoE models by implementing a grouped balanced routing mechanism, ensuring equal activation of experts within defined groups [6][12]. - This design leads to improved throughput and dynamic scalability, making it suitable for various applications [12]. Performance Metrics - The Pangu Pro MoE model demonstrates significant improvements in inference performance, achieving up to 321 tokens per second on the Ascend 300I Duo platform and 1528 tokens per second on the Ascend 800I A2 platform [16]. - The model's capabilities extend across multiple domains, showcasing strong performance in reasoning tasks and cross-language benchmarks [17][18]. Practical Applications - The introduction of Pangu Pro MoE signifies a shift from a focus on parameter quantity to practical effectiveness, enabling enterprises to leverage large models efficiently in real-time scenarios [23]. - Huawei aims to redefine the value of large models, providing a robust foundation for AI applications across various industries [23].