Workflow
并行策略
icon
Search documents
每2秒吃透一道高数大题!华为终于揭秘准万亿MoE昇腾训练系统全流程
华尔街见闻· 2025-05-30 09:38
Core Viewpoint - Huawei has achieved significant advancements in training large models through its "Ascend + Pangu Ultra MoE" system, demonstrating a fully domestic and GPU-free training process that enhances computational efficiency and model performance [3][4][38]. Group 1: Technical Innovations - Huawei's training system has achieved a model training efficiency with a utilization rate (MFU) of 41% during the pre-training phase using the Ascend Atlas 800T A2 cluster [4][38]. - The Pangu Ultra MoE model consists of 718 billion parameters, featuring a unique architecture with 61 layers, including 58 MoE layers, and is designed for high performance and scalability [38][39]. - The system supports a high throughput of 35K Tokens/s during the reinforcement learning (RL) post-training phase, showcasing its capability to process complex tasks rapidly [39]. Group 2: Challenges Addressed - The report identifies six key challenges in the current MoE pre-training and RL post-training processes, including difficulties in parallel strategy configuration, communication bottlenecks, and uneven system load distribution [7][10][12][13]. - Huawei has developed a comprehensive end-to-end solution to address these challenges, focusing on optimizing training cluster utilization and enhancing communication efficiency [14][16][25]. Group 3: Specific Solutions - The first strategy involves improving training cluster utilization through intelligent parallel strategy selection and global dynamic load balancing, significantly enhancing overall training efficiency [16][23]. - The second strategy focuses on releasing computational power at the single-node level by optimizing training operators and enhancing memory management, achieving a twofold increase in micro-batch size [26][30]. - The third strategy introduces high-performance scalable RL post-training technologies, allowing for flexible deployment modes and doubling the utilization rate of RL post-training clusters [33][34].
大模型推理,不再是“一根筋”
虎嗅APP· 2025-05-22 11:41
Core Viewpoint - The article discusses the challenges and innovations in deploying large models, particularly focusing on Huawei's approach to enhance efficiency and user experience in the context of large language models and the Mixture of Experts (MoE) architecture [1][2]. Group 1: Challenges in Large Model Deployment - The MoE architecture faces significant hardware costs and efficiency issues, making it difficult for Chinese companies to accelerate in the competitive landscape of AI [1]. - As the scale of MoE models continues to grow, the number of experts and total parameters increases exponentially, leading to severe challenges in storage and scheduling [7]. - Traditional communication strategies like AllReduce are inadequate in high concurrency scenarios, leading to inefficiencies in large model inference [8]. Group 2: Innovations by Huawei - Huawei's multi-stream parallel technology breaks the serial constraints of computation, allowing for simultaneous processing of different data streams, significantly reducing key path latency [12][15]. - The AllReduce operation has been innovatively restructured to improve communication efficiency, reducing data transmission volume by 35% and enhancing inference performance by 22-26% [15][17]. - Huawei's FlashComm technology optimizes communication in large model inference by leveraging low-dimensional data characteristics, thus improving end-to-end inference performance [21]. Group 3: Future Directions - Huawei plans to continue innovating in areas such as multi-stream parallelism and automatic weight pre-fetching to further enhance the performance of large model inference systems [21].
昇腾杀手锏FlashComm,让模型推理单车道变多车道
雷峰网· 2025-05-22 11:29
Core Viewpoint - The article discusses the communication challenges faced by MoE (Mixture of Experts) models in large-scale inference and how Huawei has addressed these issues through innovative solutions to optimize performance. Group 1: Communication Challenges - The rapid growth of MoE model parameters, often exceeding hundreds of billions, poses significant storage and scheduling challenges, leading to increased communication bandwidth demands that can cause network congestion [6][10]. - Traditional communication strategies like AllReduce have limitations, particularly in high concurrency scenarios, where they contribute significantly to end-to-end inference latency [7][11]. - The tensor parallelism (TP) approach, while effective in reducing model weight size, faces challenges with AllReduce operations that exacerbate overall network latency in multi-node deployments [7][12]. Group 2: Huawei's Solutions - Huawei introduced a multi-stream parallel technology that allows for simultaneous processing of different data streams, significantly reducing key path latency and improving performance metrics such as a 10% speedup in the Prefill phase and a 25-30% increase in Decode throughput for the DeepSeek model [12][14]. - The AllReduce operation has been restructured to first sort data intelligently (ReduceScatter) and then broadcast the essential information (AllGather), resulting in a 35% reduction in communication volume and a performance boost of 22-26% in the DeepSeek model's Prefill inference [14][15]. - By adjusting the parallel dimensions of matrix multiplication, Huawei achieved an 86% reduction in communication volume during the attention mechanism transition phase, leading to a 33% overall speedup in inference [15][19]. Group 3: Future Directions - Huawei plans to continue innovating in areas such as multi-stream parallelism, automatic weight prefetching, and model parallelism to further enhance the performance of large-scale MoE model inference systems [19][20].
帮大模型提速80%,华为拿出昇腾推理杀手锏FlashComm,三招搞定通算瓶颈
机器之心· 2025-05-22 04:13
Core Viewpoint - The article discusses the optimization of large model inference communication through Huawei's FlashComm technology, addressing the challenges posed by the exponential growth of parameters and experts in large language models (LLMs) [2][6][17]. Group 1: Communication Challenges - The increasing scale of clusters and inference concurrency in LLMs leads to significant communication pressure, particularly with the expansion of Mixture of Experts (MoE) models, where the number of experts and total parameters grows exponentially [6][18]. - Traditional communication strategies like AllReduce face limitations in high concurrency scenarios, leading to increased end-to-end inference latency due to bandwidth constraints [6][8]. Group 2: FlashComm Innovations - FlashComm1 optimizes AllReduce communication by decomposing it into ReduceScatter and AllGather, resulting in a 26% performance improvement in inference [7][11]. - FlashComm2 redefines the balance between computation and communication by transforming three-dimensional tensors into two-dimensional matrices, achieving a 33% increase in overall inference speed [7][14]. - FlashComm3 leverages multi-stream parallelism to enhance the efficiency of MoE model inference, resulting in a throughput increase of 25%-30% during the decoding phase [7][15]. Group 3: Future Directions - The Huawei team aims to further innovate in areas such as multi-stream parallelism, automatic weight prefetching, and model automatic multi-stream parallelism to enhance the performance of large model inference systems [17][18].