Workflow
Pangu Ultra MoE模型
icon
Search documents
每2秒吃透一道高数大题!华为终于揭秘准万亿MoE昇腾训练系统全流程
华尔街见闻· 2025-05-30 09:38
Core Viewpoint - Huawei has achieved significant advancements in training large models through its "Ascend + Pangu Ultra MoE" system, demonstrating a fully domestic and GPU-free training process that enhances computational efficiency and model performance [3][4][38]. Group 1: Technical Innovations - Huawei's training system has achieved a model training efficiency with a utilization rate (MFU) of 41% during the pre-training phase using the Ascend Atlas 800T A2 cluster [4][38]. - The Pangu Ultra MoE model consists of 718 billion parameters, featuring a unique architecture with 61 layers, including 58 MoE layers, and is designed for high performance and scalability [38][39]. - The system supports a high throughput of 35K Tokens/s during the reinforcement learning (RL) post-training phase, showcasing its capability to process complex tasks rapidly [39]. Group 2: Challenges Addressed - The report identifies six key challenges in the current MoE pre-training and RL post-training processes, including difficulties in parallel strategy configuration, communication bottlenecks, and uneven system load distribution [7][10][12][13]. - Huawei has developed a comprehensive end-to-end solution to address these challenges, focusing on optimizing training cluster utilization and enhancing communication efficiency [14][16][25]. Group 3: Specific Solutions - The first strategy involves improving training cluster utilization through intelligent parallel strategy selection and global dynamic load balancing, significantly enhancing overall training efficiency [16][23]. - The second strategy focuses on releasing computational power at the single-node level by optimizing training operators and enhancing memory management, achieving a twofold increase in micro-batch size [26][30]. - The third strategy introduces high-performance scalable RL post-training technologies, allowing for flexible deployment modes and doubling the utilization rate of RL post-training clusters [33][34].
Pangu Ultra准万亿MoE模型:业界一流,源自昇腾原生的长稳训练
雷峰网· 2025-05-29 11:44
Core Viewpoint - Huawei's Pangu Ultra MoE model, with a parameter scale of 718 billion, represents a significant advancement in the training of ultra-large sparse models, achieving a balance between model performance and efficiency [5][8]. Group 1: Model Architecture and Training Innovations - Pangu Ultra MoE employs a Depth-Scaled Sandwich-Norm (DSSN) architecture and TinyInit initialization method, enabling stable training of over 10 trillion tokens [9][12]. - The model utilizes an EP loss optimization method to ensure load balancing among experts while enhancing their specialization capabilities [15][19]. - The architecture integrates advanced mechanisms such as Multi-head Latent Attention (MLA) and Multi-token Prediction (MTP) to improve training efficiency and inference speed [6][23]. Group 2: Performance Metrics and Comparisons - Pangu Ultra MoE has a total parameter count of 718 billion, with 39 billion activated parameters, and demonstrates superior performance across various benchmarks compared to existing models [8][21]. - The model's training stability is enhanced by reducing gradient spike rates by 51%, which contributes to improved convergence speed and overall performance [14][12]. Group 3: Load Balancing and Expert Specialization - The EP-Group load balancing loss function allows for a more flexible routing of tokens to experts, promoting specialization without compromising computational efficiency [19][20]. - The model's architecture is designed to accommodate 256 routing experts, with each token activating 8 experts, optimizing the distribution of computational load [5][7]. Group 4: Reinforcement Learning and Multi-capability Training - The training system incorporates iterative hard example mining and a multi-capability reward system to enhance model performance across various tasks, including mathematics and coding [28][32]. - The reinforcement learning approach ensures that the model maintains high efficiency in inference while balancing the growth of different capabilities [29][32].
Pangu Ultra准万亿MoE模型:业界一流,源自昇腾原生的长稳训练
第一财经· 2025-05-29 10:50
Core Viewpoint - The article discusses the advancements in the Pangu Ultra MoE model, which is a near-trillion parameter MoE model trained on Ascend NPUs, focusing on its architecture, training methods, and performance improvements [1][3]. Group 1: Model Architecture and Training Innovations - Pangu Ultra MoE features a total parameter count of 718 billion, with 39 billion activated parameters, utilizing 256 routing experts where each token activates 8 experts [5][6]. - The model employs Depth-Scaled Sandwich-Norm (DSSN) and TinyInit methods to enhance training stability, achieving a 51% reduction in gradient spikes [7][11]. - The training process incorporates a dropless training strategy, allowing for long-term stable training on over 10 trillion tokens [1][7]. Group 2: Performance and Efficiency - The architecture is designed to optimize performance on the Ascend NPU platform by integrating computation, communication, and memory metrics, resulting in superior training and inference throughput [3][5]. - Pangu Ultra MoE demonstrates robust performance across various authoritative open-source evaluation sets, outperforming several mainstream models in multiple benchmarks [6][4]. Group 3: Load Balancing and Expert Specialization - The EP group loss method is introduced to maintain load balancing among experts while allowing for expert specialization, enhancing overall training efficiency [12][15]. - The model's design allows for flexible routing choices, promoting expert specialization based on the data domain, which is evidenced by significant differences in expert selection across various languages [16][17]. Group 4: Multi-Token Prediction and Reinforcement Learning - The Multi-Token Prediction (MTP) strategy enhances inference efficiency by predicting multiple candidate tokens before the main model generates them, achieving a 38% increase in acceptance length [20][22]. - The reinforcement learning system implemented in Pangu Ultra MoE addresses challenges in training stability and inference performance by iteratively mining difficult examples and employing a multi-capability reward system [24][27].
训练大模型,终于可以“既要又要还要”了
虎嗅APP· 2025-05-29 10:34
Core Insights - The article discusses the advancements in the MoE (Mixture of Experts) model architecture, particularly focusing on Huawei's Pangu Ultra MoE, which aims to balance model performance and efficiency while addressing challenges in training large-scale models [1][6][33] Group 1: MoE Model Innovations - Huawei's Pangu Ultra MoE model features a parameter scale of 718 billion, designed to optimize the performance and efficiency of large-scale MoE architectures [6][9] - The model incorporates advanced architectures such as MLA (Multi-head Latent Attention) and MTP (Multi-token Prediction), enhancing its training and inference capabilities [6][7] - The Depth-Scaled Sandwich-Norm (DSSN) and TinyInit methods are introduced to improve training stability, reducing gradient spikes by 51% and enabling long-term stable training with over 10 trillion tokens [11][12][14] Group 2: Load Balancing and Efficiency - The EP (Expert Parallelism) group load balancing method is designed to ensure efficient token distribution among experts, enhancing training efficiency without compromising model specialization [19][20] - The Pangu Ultra MoE model employs an EP-Group load balancing loss that allows for flexible routing choices, promoting expert specialization while maintaining computational efficiency [20][21] Group 3: Training Techniques and Performance - The model's pre-training phase utilizes dropless training, achieving a long sequence capability of 128k, which enhances its learning efficiency on target data [8][14] - The introduction of MTP allows for speculative inference, significantly improving the acceptance length by 38% compared to single-token predictions [24][27] - The reinforcement learning system designed for post-training focuses on iterative hard example mining and multi-capability collaboration, ensuring comprehensive performance across various tasks [28][31] Group 4: Future Implications - The advancements presented in Pangu Ultra MoE provide a viable path for deploying sparse large models at scale, pushing the performance limits and engineering applicability of MoE architectures [33]