Workflow
华为盘古大模型首次开源!昇腾单卡秒输出1148tokens,16B激活参数不输32B密集模型
量子位·2025-07-02 09:33

Core Viewpoint - Huawei's Pangu Pro MoE model has been open-sourced, featuring 72 billion parameters and demonstrating competitive performance against 32 billion dense models in both Chinese and English understanding and reasoning capabilities [1][8]. Model Performance - The Pangu Pro MoE model has a total parameter count of 72 billion, with 16 billion activated parameters, representing 22.2% of the total [8]. - In various tests, Pangu Pro MoE performs comparably to 32 billion dense models, achieving notable scores in benchmarks such as MMLU and DROP [9][11][12]. - Specifically, it scored 82.6 in MMLU-PRO, surpassing other models, and achieved 91.1 in C-Eval for Chinese tasks, outperforming Qwen3-32B [10][12]. Inference Efficiency - The model exhibits high inference efficiency, achieving an average input throughput of 4828 tokens per second on a single card with W8A8 quantization, which is a 203% improvement over 72 billion and 42% over 32 billion dense models [17]. - During the decoder phase, it reached an output throughput of 1148 tokens per second, outperforming both 72 billion and 32 billion dense models [19]. Architecture Innovations - Pangu Pro MoE introduces a new MoE architecture optimized for Ascend chips, utilizing a Mixture of Grouped Experts (MoGE) approach to achieve load balancing across devices [22][24]. - The model's training and inference facilities have been specifically adapted for the Ascend cluster, enhancing communication efficiency and reducing overhead [30][32]. Quantization and Optimization - The model employs expert-aware post-training quantization and KV cache compression to optimize inference efficiency while maintaining model accuracy [37][38]. - Operator fusion techniques have been implemented to enhance memory bandwidth utilization, achieving significant acceleration in attention operations [39][41]. Technical Reports and Resources - Technical reports in both Chinese and English have been published, detailing the model's architecture and performance metrics [4][45].