MoE模型

Search documents
华为,重大发布!
证券时报· 2025-06-20 10:40
Core Viewpoint - Huawei's Pangu model has made significant advancements in various industries, demonstrating its capabilities in over 30 sectors and 500 scenarios, with the launch of Pangu model 5.5 marking a comprehensive upgrade in five foundational models [1] Group 1: Pangu Model Developments - The Pangu model has been successfully implemented in sectors such as government, finance, manufacturing, healthcare, coal mining, steel, railways, autonomous driving, and meteorology, showcasing its transformative potential across industries [1] - Huawei introduced the Pangu Ultra MoE model with a parameter scale of 718 billion, representing a significant leap in model training capabilities on the Ascend AI computing platform [1][2] - The Pangu team has innovated in model architecture and training methods, achieving stable training of ultra-large sparse models, which is a notable challenge in the field [2] Group 2: Technical Innovations - The introduction of Depth-Scaled Sandwich-Norm (DSSN) architecture and TinyInit initialization method has enabled long-term stable training with over 18TB of data on the Ascend platform [2] - The Pangu Ultra MoE model employs advanced architectures like MLA and MTP, optimizing both pre-training and post-training phases to balance model performance and efficiency [2][3] - Recent upgrades to the training system have improved the efficiency of the pre-training process, significantly increasing the performance metrics from 30% to 41% [3] Group 3: Industry Impact and Ecosystem Development - The advancements in the Pangu model signify a full-stack domestic capability in AI, achieving international standards in ultra-large sparse model training and optimization [4] - The launch of HarmonyOS 6 at the Huawei Developer Conference 2025 aims to enhance user experience and AI capabilities across various applications [4] - The Harmony ecosystem is entering a new phase of acceleration, with over 30,000 applications and services in development, indicating a significant demand for talent in the industry [5]
2025H2新型硬件展望:从科技树节点,看新型硬件
Shenwan Hongyuan Securities· 2025-06-09 07:39
Investment Rating - The report does not explicitly state an investment rating for the industry Core Insights - The report emphasizes the importance of hardware-software innovation axes, predicting significant advancements in new hardware technologies by 2025H2, with a focus on both short-term and long-term investment opportunities [4][20] - Key short-term opportunities include GPU+HPM, optical devices, silicon photonics, lidar, automotive chips, RoboVan, and AI glasses, while long-term innovations are deemed more critical [4][20] - The report highlights the 2B market opportunities in optical devices, silicon photonics, GPU, and high-end products, alongside 2C market opportunities in automotive, RoboVan, wearables, and bio-electronic interactive devices [4][20] Summary by Sections 1. Hardware-Software Innovation Axes - The report discusses the "hardware Y-software X" axis as a framework for predicting new hardware innovations, linking technological advancements from 2022H2 to 2025H2 [4][20] - It identifies the need for a focus on architecture innovation and "physical-chemical-biological AI" as critical elements for future hardware development [4][20] 2. Market Opportunities - The 2B market is characterized by opportunities in optical devices, silicon photonics, and high-end GPUs, while the 2C market includes automotive technologies, RoboVan, wearables, and bio-electronic devices [4][20] - The report notes that the optical device opportunities arise from the MoE architecture, which differs from simple computational upgrades under the "Scaling Law" [4][20] 3. Underestimated Factors - The report points out two often-overlooked factors: architecture innovation and the integration of physical-chemical-biological AI, which are crucial for the advancement of new hardware [4][20] 4. Representative Companies - The report lists several companies as representative in the new hardware space, including: - Optical devices: NewEase, Zhongji Xuchuang, Huagong Technology, Changguang Huaxin - Lidar: Hesai Technology (US), Suteng Juchuang (HK) - AR+AI glasses: Hongjing Optoelectronics, Crystal Optoelectronics, Hongsoft Technology, GoerTek, Xiaomi Group (HK) - Advanced semiconductor processes and GPUs: SMIC, Muxi Integration, Suiyuan Technology, Haiguang Information, Cambrian [6][20]
爆改大模型训练,华为打出昇腾+鲲鹏组合拳
虎嗅APP· 2025-06-04 10:35
Core Viewpoint - The article discusses Huawei's advancements in AI training, particularly through the optimization of the Mixture of Experts (MoE) model architecture, which enhances efficiency and reduces costs in AI model training [1][34]. Group 1: MoE Model and Its Challenges - The MoE model has become a preferred path for tech giants in developing stronger AI systems, with its unique architecture addressing the computational bottlenecks of large-scale model training [2]. - Huawei has identified two main challenges in improving single-node training efficiency: low operator computation efficiency and insufficient NPU memory [6][7]. Group 2: Enhancements in Training Efficiency - Huawei's collaboration between Ascend and Kunpeng has significantly improved training operator computation efficiency and memory utilization, achieving a 20% increase in throughput and a 70% reduction in memory usage [3][18]. - The article highlights three optimization strategies for core operators in MoE models: "Slimming Technique" for FlashAttention, "Balancing Technique" for MatMul, and "Transport Technique" for Vector operators, leading to a 15% increase in overall training throughput [9][10][13]. Group 3: Operator Dispatch Optimization - The article details how Huawei's optimizations have led to nearly zero waiting time for operator dispatch, enhancing the utilization of computational power [19][25]. - The Selective R/S memory optimization technique allows for a 70% reduction in memory for activation values during training, showcasing Huawei's innovative approach to memory management [26][34]. Group 4: Industry Implications - Huawei's advancements in AI training not only clear obstacles for large-scale MoE model training but also provide valuable reference paths for the industry, demonstrating the company's deep technical accumulation in AI computing [34].
华为的准万亿大模型,是如何训练的?
虎嗅APP· 2025-05-30 10:18
Core Viewpoint - The article discusses Huawei's advancements in AI training systems, particularly focusing on the MoE (Mixture of Experts) architecture and its optimization through the MoGE (Mixture of Generalized Experts) framework, which enhances efficiency and reduces costs in AI model training [1][2]. Summary by Sections Introduction to MoE and Huawei's Innovations - The MoE model, initially proposed by Canadian scholars, has evolved significantly, with Huawei now optimizing this architecture to address inefficiencies and cost issues [1]. - Huawei's MoGE architecture aims to create a more balanced and efficient training environment for AI models, contributing to the ongoing AI competition [1]. Performance Metrics and Achievements - Huawei's training system, utilizing the "昇腾+Pangu Ultra MoE" combination, has achieved significant performance metrics, including a 41% MFU (Model Floating Utilization) during pre-training and a throughput of 35K Tokens/s during post-training on the CloudMatrix 384 super node [2][26][27]. Challenges in MoE Training - Six main challenges in MoE training processes are identified: difficulty in parallel strategy configuration, All-to-All communication bottlenecks, uneven system load distribution, excessive operator scheduling overhead, complex training process management, and limitations in large-scale expansion [3][4]. Solutions and Innovations - **First Strategy: Enhancing Training Cluster Utilization** - Huawei implemented intelligent parallel strategy selection and global dynamic load balancing to improve overall training efficiency [6][11]. - A modeling simulation framework was developed to automate the selection of optimal parallel configurations for the Pangu Ultra MoE model [7]. - **Second Strategy: Releasing Computing Power of Single Nodes** - The focus shifted to optimizing operator computation efficiency, achieving a twofold increase in micro-batch size (MBS) and reducing host-bound issues to below 2% [15][16][17]. - **Third Strategy: High-Performance Scalable RL Post-Training Technologies** - The introduction of RL Fusion technology allows for flexible deployment modes and significantly improves resource utilization during post-training [19][21]. - The system's design enables a 50% increase in overall training throughput while maintaining model accuracy [21]. Technical Specifications of Pangu Ultra MoE - The Pangu Ultra MoE model features 718 billion parameters, with a structure that includes 61 layers of Transformer architecture, achieving high performance and scalability [26]. - The training utilized a large-scale cluster of 6K - 10K cards, demonstrating strong generalization capabilities and efficient scaling potential [26][27].
华为发布准万亿模型Pangu Ultra MoE模型架构和训练细节
news flash· 2025-05-30 07:33
Core Insights - Huawei has made significant advancements in the MoE model training field by launching a new model called Pangu Ultra MoE, which has a parameter scale of 718 billion [1] - The model is trained on the Ascend AI computing platform and represents a near-trillion MoE model, showcasing the performance leap in ultra-large-scale MoE training [1] - Huawei has released a technical report detailing the architecture and training methods of the Pangu Ultra MoE model, highlighting innovative designs to address challenges in training stability for ultra-large-scale and highly sparse MoE models [1]
Pangu Ultra准万亿MoE模型:业界一流,源自昇腾原生的长稳训练
第一财经· 2025-05-29 10:50
Core Viewpoint - The article discusses the advancements in the Pangu Ultra MoE model, which is a near-trillion parameter MoE model trained on Ascend NPUs, focusing on its architecture, training methods, and performance improvements [1][3]. Group 1: Model Architecture and Training Innovations - Pangu Ultra MoE features a total parameter count of 718 billion, with 39 billion activated parameters, utilizing 256 routing experts where each token activates 8 experts [5][6]. - The model employs Depth-Scaled Sandwich-Norm (DSSN) and TinyInit methods to enhance training stability, achieving a 51% reduction in gradient spikes [7][11]. - The training process incorporates a dropless training strategy, allowing for long-term stable training on over 10 trillion tokens [1][7]. Group 2: Performance and Efficiency - The architecture is designed to optimize performance on the Ascend NPU platform by integrating computation, communication, and memory metrics, resulting in superior training and inference throughput [3][5]. - Pangu Ultra MoE demonstrates robust performance across various authoritative open-source evaluation sets, outperforming several mainstream models in multiple benchmarks [6][4]. Group 3: Load Balancing and Expert Specialization - The EP group loss method is introduced to maintain load balancing among experts while allowing for expert specialization, enhancing overall training efficiency [12][15]. - The model's design allows for flexible routing choices, promoting expert specialization based on the data domain, which is evidenced by significant differences in expert selection across various languages [16][17]. Group 4: Multi-Token Prediction and Reinforcement Learning - The Multi-Token Prediction (MTP) strategy enhances inference efficiency by predicting multiple candidate tokens before the main model generates them, achieving a 38% increase in acceptance length [20][22]. - The reinforcement learning system implemented in Pangu Ultra MoE addresses challenges in training stability and inference performance by iteratively mining difficult examples and employing a multi-capability reward system [24][27].
从“积木堆叠”到“有机生命体”:昇腾超节点重新定义AI算力架构
Huan Qiu Wang· 2025-05-26 10:06
Core Insights - The rapid growth of large models in AI is driving a new era of computing power demand, highlighting the limitations of traditional cluster architectures in efficiently training these models [1][2] - Traditional architectures face significant challenges, including communication bottlenecks, inefficient resource allocation, and reliability issues, which hinder the training efficiency of large models [2][3] Summary by Sections Challenges in Traditional Architectures - Communication bottlenecks have worsened exponentially, with MoE models increasing inter-node communication demands, leading to delays of over 2ms in traditional 400G networks [1][2] - Resource allocation is static and unable to adapt to dynamic changes in model structure, resulting in a 30% decrease in overall training efficiency due to uneven load distribution [1][2] - Reliability is compromised as the probability of node failure increases with scale, causing significant resource waste during lengthy recovery processes, with some companies losing over a million dollars per training interruption [2] Emergence of Ascend Supernode Architecture - The Ascend Supernode architecture represents a fundamental restructuring of computing power systems, characterized by a "three-dimensional integration" approach [3][5] - A breakthrough in hardware interconnectivity allows multiple NPUs to work as a single computer, increasing inter-node communication bandwidth by 15 times and reducing latency from 2ms to 0.2ms [3][5] - Unified global memory addressing through virtualization enables direct memory access across nodes, enhancing efficiency in parameter synchronization during model training [5][6] Innovations in Resource Management and Reliability - Intelligent resource scheduling allows for fine-grained dynamic task allocation based on the MoE model structure, improving the compute-to-communication time ratio from 1:1 to 3:1 [5][6] - The reliability of the system has been significantly improved, with average uptime increasing from hours to days, and recovery times reduced from hours to 15 minutes [5][6] Industry Impact and Future Prospects - The Ascend Supernode architecture has achieved a threefold increase in training performance compared to traditional nodes, establishing a new benchmark in AI computing [8] - The introduction of MindIE Motor enhances large-scale expert parallel capabilities, achieving four times the throughput of traditional server stacks [8] - Huawei's commitment to architecture innovation is seen as a new form of Moore's Law, positioning the company as a leader in the AI computing landscape [9]
昇腾杀手锏FlashComm,让模型推理单车道变多车道
雷峰网· 2025-05-22 11:29
Core Viewpoint - The article discusses the communication challenges faced by MoE (Mixture of Experts) models in large-scale inference and how Huawei has addressed these issues through innovative solutions to optimize performance. Group 1: Communication Challenges - The rapid growth of MoE model parameters, often exceeding hundreds of billions, poses significant storage and scheduling challenges, leading to increased communication bandwidth demands that can cause network congestion [6][10]. - Traditional communication strategies like AllReduce have limitations, particularly in high concurrency scenarios, where they contribute significantly to end-to-end inference latency [7][11]. - The tensor parallelism (TP) approach, while effective in reducing model weight size, faces challenges with AllReduce operations that exacerbate overall network latency in multi-node deployments [7][12]. Group 2: Huawei's Solutions - Huawei introduced a multi-stream parallel technology that allows for simultaneous processing of different data streams, significantly reducing key path latency and improving performance metrics such as a 10% speedup in the Prefill phase and a 25-30% increase in Decode throughput for the DeepSeek model [12][14]. - The AllReduce operation has been restructured to first sort data intelligently (ReduceScatter) and then broadcast the essential information (AllGather), resulting in a 35% reduction in communication volume and a performance boost of 22-26% in the DeepSeek model's Prefill inference [14][15]. - By adjusting the parallel dimensions of matrix multiplication, Huawei achieved an 86% reduction in communication volume during the attention mechanism transition phase, leading to a 33% overall speedup in inference [15][19]. Group 3: Future Directions - Huawei plans to continue innovating in areas such as multi-stream parallelism, automatic weight prefetching, and model parallelism to further enhance the performance of large-scale MoE model inference systems [19][20].