准万亿MoE模型

Search documents
Pangu Ultra准万亿MoE模型:业界一流,源自昇腾原生的长稳训练
雷峰网· 2025-05-29 11:44
Core Viewpoint - Huawei's Pangu Ultra MoE model, with a parameter scale of 718 billion, represents a significant advancement in the training of ultra-large sparse models, achieving a balance between model performance and efficiency [5][8]. Group 1: Model Architecture and Training Innovations - Pangu Ultra MoE employs a Depth-Scaled Sandwich-Norm (DSSN) architecture and TinyInit initialization method, enabling stable training of over 10 trillion tokens [9][12]. - The model utilizes an EP loss optimization method to ensure load balancing among experts while enhancing their specialization capabilities [15][19]. - The architecture integrates advanced mechanisms such as Multi-head Latent Attention (MLA) and Multi-token Prediction (MTP) to improve training efficiency and inference speed [6][23]. Group 2: Performance Metrics and Comparisons - Pangu Ultra MoE has a total parameter count of 718 billion, with 39 billion activated parameters, and demonstrates superior performance across various benchmarks compared to existing models [8][21]. - The model's training stability is enhanced by reducing gradient spike rates by 51%, which contributes to improved convergence speed and overall performance [14][12]. Group 3: Load Balancing and Expert Specialization - The EP-Group load balancing loss function allows for a more flexible routing of tokens to experts, promoting specialization without compromising computational efficiency [19][20]. - The model's architecture is designed to accommodate 256 routing experts, with each token activating 8 experts, optimizing the distribution of computational load [5][7]. Group 4: Reinforcement Learning and Multi-capability Training - The training system incorporates iterative hard example mining and a multi-capability reward system to enhance model performance across various tasks, including mathematics and coding [28][32]. - The reinforcement learning approach ensures that the model maintains high efficiency in inference while balancing the growth of different capabilities [29][32].