Workflow
分组混合专家模型(Mixture of Grouped Experts
icon
Search documents
华为盘古大模型首次开源!昇腾单卡秒输出1148tokens,16B激活参数不输32B密集模型
量子位· 2025-07-02 09:33
Core Viewpoint - Huawei's Pangu Pro MoE model has been open-sourced, featuring 72 billion parameters and demonstrating competitive performance against 32 billion dense models in both Chinese and English understanding and reasoning capabilities [1][8]. Model Performance - The Pangu Pro MoE model has a total parameter count of 72 billion, with 16 billion activated parameters, representing 22.2% of the total [8]. - In various tests, Pangu Pro MoE performs comparably to 32 billion dense models, achieving notable scores in benchmarks such as MMLU and DROP [9][11][12]. - Specifically, it scored 82.6 in MMLU-PRO, surpassing other models, and achieved 91.1 in C-Eval for Chinese tasks, outperforming Qwen3-32B [10][12]. Inference Efficiency - The model exhibits high inference efficiency, achieving an average input throughput of 4828 tokens per second on a single card with W8A8 quantization, which is a 203% improvement over 72 billion and 42% over 32 billion dense models [17]. - During the decoder phase, it reached an output throughput of 1148 tokens per second, outperforming both 72 billion and 32 billion dense models [19]. Architecture Innovations - Pangu Pro MoE introduces a new MoE architecture optimized for Ascend chips, utilizing a Mixture of Grouped Experts (MoGE) approach to achieve load balancing across devices [22][24]. - The model's training and inference facilities have been specifically adapted for the Ascend cluster, enhancing communication efficiency and reducing overhead [30][32]. Quantization and Optimization - The model employs expert-aware post-training quantization and KV cache compression to optimize inference efficiency while maintaining model accuracy [37][38]. - Operator fusion techniques have been implemented to enhance memory bandwidth utilization, achieving significant acceleration in attention operations [39][41]. Technical Reports and Resources - Technical reports in both Chinese and English have been published, detailing the model's architecture and performance metrics [4][45].
华为盘古首次露出,昇腾原生72B MoE架构,SuperCLUE千亿内模型并列国内第一
雷峰网· 2025-05-28 12:06
Core Insights - The article discusses the transition of large models from a "parameter arms race" to "effectiveness" through the introduction of the Pangu Pro MoE model, which utilizes a new approach to improve efficiency and performance in large-scale applications [29]. Group 1: Model Architecture and Innovations - The Pangu Pro MoE model features a total of 72 billion parameters and 16 billion active parameters, achieving a score of 59 on the SuperCLUE benchmark, ranking it among the top in its category [2][3]. - The model employs a Mixture of Grouped Experts (MoGE) architecture, which ensures balanced computational load across devices by grouping experts and activating a consistent number of them for each token [8][14]. - The MoGE architecture addresses the inefficiencies of traditional MoE models, which often suffer from uneven expert activation and can lead to bottlenecks in system efficiency [7][13]. Group 2: Performance Metrics - On the Ascend 300I Duo platform, the Pangu Pro MoE achieves a throughput of 321 tokens per second, while on the Ascend 800I A2 platform, it can reach up to 1528 tokens per second under high concurrency conditions [21][22]. - Compared to other large models, the Pangu Pro MoE demonstrates superior performance in complex reasoning tasks, outperforming models like Qwen3-32B and GLM4-Z1-32B in various benchmarks [25][26]. Group 3: Industry Impact and Applications - The introduction of the Pangu Pro MoE model is expected to lower cloud inference costs and support high-concurrency real-time scenarios, making it suitable for enterprise-level applications [29]. - The model's lightweight inference engine is designed to be compatible with Huawei's Ascend series chips, enabling the deployment of large-scale models in various AI applications [29]. - The advancements in the Pangu Pro MoE model signify a shift in the AI industry towards practical applications, emphasizing efficiency and accessibility for businesses [29].