Pangu Ultra准万亿MoE模型：业界一流，源自昇腾原生的长稳训练

Core Viewpoint - The article discusses the advancements in the Pangu Ultra MoE model, which is a near-trillion parameter MoE model trained on Ascend NPUs, focusing on its architecture, training methods, and performance improvements [1][3]. Group 1: Model Architecture and Training Innovations - Pangu Ultra MoE features a total parameter count of 718 billion, with 39 billion activated parameters, utilizing 256 routing experts where each token activates 8 experts [5][6]. - The model employs Depth-Scaled Sandwich-Norm (DSSN) and TinyInit methods to enhance training stability, achieving a 51% reduction in gradient spikes [7][11]. - The training process incorporates a dropless training strategy, allowing for long-term stable training on over 10 trillion tokens [1][7]. Group 2: Performance and Efficiency - The architecture is designed to optimize performance on the Ascend NPU platform by integrating computation, communication, and memory metrics, resulting in superior training and inference throughput [3][5]. - Pangu Ultra MoE demonstrates robust performance across various authoritative open-source evaluation sets, outperforming several mainstream models in multiple benchmarks [6][4]. Group 3: Load Balancing and Expert Specialization - The EP group loss method is introduced to maintain load balancing among experts while allowing for expert specialization, enhancing overall training efficiency [12][15]. - The model's design allows for flexible routing choices, promoting expert specialization based on the data domain, which is evidenced by significant differences in expert selection across various languages [16][17]. Group 4: Multi-Token Prediction and Reinforcement Learning - The Multi-Token Prediction (MTP) strategy enhances inference efficiency by predicting multiple candidate tokens before the main model generates them, achieving a 38% increase in acceptance length [20][22]. - The reinforcement learning system implemented in Pangu Ultra MoE addresses challenges in training stability and inference performance by iteratively mining difficult examples and employing a multi-capability reward system [24][27].