华为盘古首次露出，昇腾原生72B MoE架构，SuperCLUE千亿内模型并列国内第一

Core Insights - The article discusses the emergence of the Mixture of Grouped Experts (MoGE) model by Huawei's Pangu team, which addresses the inefficiencies of traditional Mixture of Experts (MoE) models by ensuring balanced computational load across devices [2][6][31] - Pangu Pro MoE, built on the MoGE architecture, has demonstrated superior performance in industry benchmarks, achieving a score of 59 on the SuperCLUE leaderboard with only 72 billion parameters, making it competitive against larger models [3][26] Technical Innovations - The MoGE model introduces a grouping mechanism during the expert selection phase, which ensures that each token activates an equal number of experts within predefined groups, thus achieving load balancing across devices [2][12] - The architecture utilizes a batch-level auxiliary loss function to maintain balanced expert activation, enhancing overall model efficiency [16][18] Performance Metrics - Pangu Pro MoE achieves a throughput of 321 tokens/s on the Ascend 300I Duo platform and 1528 tokens/s on the Ascend 800I A2 platform, significantly outperforming other models of similar scale [24] - The model exhibits a nearly uniform expert load distribution, with each expert handling approximately 12.5% of the total token volume, indicating efficient resource utilization [29] Industry Impact - The introduction of Pangu Pro MoE signifies a shift from a "parameter arms race" to a focus on practical applications, reducing cloud inference costs and supporting high-concurrency real-time scenarios [31] - Huawei's innovations in the AI field aim to redefine the value of large models, providing a robust foundation for enterprises to deploy billion-parameter models effectively [31]