混合专家模型(MoE)
Search documents
华为盘古首次露出,昇腾原生72B MoE架构,SuperCLUE千亿内模型并列国内第一
华尔街见闻· 2025-05-29 00:57
Core Insights - The emergence of the Mixture of Grouped Experts (MoGE) model by Huawei's Pangu team addresses the inefficiencies of traditional Mixture of Experts (MoE) models, ensuring balanced computational load across devices while maintaining high performance [1][7][27] - The Pangu Pro MoE model, with 72 billion total parameters and 16 billion active parameters, achieves competitive performance in the industry, ranking first among models with less than 100 billion parameters in China [2][22] Group 1: Model Architecture and Efficiency - The MoGE architecture introduces a grouping mechanism that ensures balanced expert activation, significantly improving computational efficiency and reducing system bottlenecks [1][6][12] - The model demonstrates superior throughput, achieving 321 tokens/s on the Ascend 300I Duo platform and 1528 tokens/s on the Ascend 800I A2 platform, outperforming similar-sized dense models [18][26] Group 2: Performance Metrics - In the latest SuperCLUE ranking, Pangu Pro MoE scored 58.75, showcasing its strong capabilities in various reasoning tasks and outperforming other models in complex reasoning scenarios [3][22] - The model exhibits excellent performance across multiple benchmarks, including English and Chinese language tasks, demonstrating its versatility and adaptability in complex cognitive tasks [22][23][24] Group 3: Industry Impact - The introduction of Pangu Pro MoE signifies a shift in the AI industry from a focus on parameter quantity to practical application, enabling efficient cloud inference and supporting high-concurrency real-time scenarios [27] - Huawei's innovations in the MoE architecture redefine the value of large models, providing a robust foundation for AI applications across various industries [27]
华为盘古首次露出,昇腾原生72B MoE架构,SuperCLUE千亿内模型并列国内第一
机器之心· 2025-05-28 08:09
Core Insights - The article discusses the emergence of the Mixture of Grouped Experts (MoGE) model by Huawei's Pangu team, which addresses the inefficiencies of traditional Mixture of Experts (MoE) models by ensuring balanced computational load across devices [2][6][31] - Pangu Pro MoE, built on the MoGE architecture, has demonstrated superior performance in industry benchmarks, achieving a score of 59 on the SuperCLUE leaderboard with only 72 billion parameters, making it competitive against larger models [3][26] Technical Innovations - The MoGE model introduces a grouping mechanism during the expert selection phase, which ensures that each token activates an equal number of experts within predefined groups, thus achieving load balancing across devices [2][12] - The architecture utilizes a batch-level auxiliary loss function to maintain balanced expert activation, enhancing overall model efficiency [16][18] Performance Metrics - Pangu Pro MoE achieves a throughput of 321 tokens/s on the Ascend 300I Duo platform and 1528 tokens/s on the Ascend 800I A2 platform, significantly outperforming other models of similar scale [24] - The model exhibits a nearly uniform expert load distribution, with each expert handling approximately 12.5% of the total token volume, indicating efficient resource utilization [29] Industry Impact - The introduction of Pangu Pro MoE signifies a shift from a "parameter arms race" to a focus on practical applications, reducing cloud inference costs and supporting high-concurrency real-time scenarios [31] - Huawei's innovations in the AI field aim to redefine the value of large models, providing a robust foundation for enterprises to deploy billion-parameter models effectively [31]
华为+DeepSeek,终于不再“服务器繁忙”?
虎嗅APP· 2025-05-20 14:00
Core Viewpoint - The article discusses the challenges and advancements in the development of large language models, particularly focusing on the MoE (Mixture of Experts) architecture and how Huawei has innovated to enhance its performance and efficiency in this domain [1][4]. Group 1: Challenges of MoE Models - The MoE architecture faces significant challenges, particularly the "cold and hot expert" phenomenon, which leads to uneven load distribution and affects system performance [4][3]. - The uneven load results in increased inference latency and limited throughput due to underutilization of resources [4][3]. Group 2: Huawei's Innovations - Huawei has introduced an efficient load balancing strategy called OmniPlacement, which significantly improves the inference performance of MoE models through expert reallocation, inter-layer redundancy deployment, and near-real-time dynamic scheduling [7][6]. - The OmniPlacement algorithm optimizes the deployment order based on expert activation data, reducing the load imbalance and enhancing system performance [7][6]. Group 3: Key Features of OmniPlacement - The framework supports dynamic priority adjustment and communication domain optimization, which reduces communication overhead compared to traditional static allocation methods [7][9]. - It includes a near-real-time scheduling and dynamic monitoring mechanism that allows for efficient expert allocation and minimizes inference delays [10][9]. Group 4: Experimental Results - Testing on the DeepSeek-V3 model showed that OmniPlacement reduced inference latency by approximately 10% and increased system throughput by about 10%, demonstrating significant improvements in resource utilization [14][14]. - The system maintained stability under dynamic input and high-load conditions, ensuring no performance fluctuations or service interruptions [14][14]. Group 5: Future Directions - Future research will focus on optimizing scheduling algorithms, developing adaptive expert selection mechanisms, and expanding the OmniPlacement framework to support more types of MoE models [15][15]. - The release of OmniPlacement marks a significant advancement in MoE model inference performance and highlights Huawei's competitive edge in AI computing [15][15].
华为发布OmniPlacement技术,实现超大规模MoE专家最优动态部署,提升昇腾推理系统吞吐10%
雷峰网· 2025-05-20 13:01
Core Viewpoint - The article discusses the challenges and advancements in the Mixed Expert Model (MoE) technology, particularly focusing on the load balancing issues and the introduction of the OmniPlacement strategy by Huawei to enhance inference performance [2][4][12]. Group 1: Challenges in MoE Models - The MoE models face significant challenges, particularly the "cold and hot expert" phenomenon, where some experts are frequently called (hot experts) while others are rarely used (cold experts), leading to uneven load distribution [2][4]. - This imbalance results in increased inference latency and limited throughput, as underutilized resources restrict overall system performance [3][14]. Group 2: OmniPlacement Strategy - Huawei's OmniPlacement strategy addresses these challenges through expert reallocation, inter-layer redundancy deployment, and near-real-time dynamic scheduling, significantly improving MoE model inference performance [4][12]. - The strategy includes a joint optimization algorithm that reduces load imbalance by analyzing expert activation data and optimizing deployment order based on call frequency and computational needs [5][14]. Group 3: Key Features of OmniPlacement - OmniPlacement employs inter-layer redundancy deployment to alleviate the pressure on hot experts by allocating additional redundant instances, thus enhancing system throughput [5][12]. - The framework supports dynamic resource allocation based on real-time resource usage and expert call frequency, allowing for predictive resource distribution to minimize performance discrepancies between hot and cold experts [6][9]. Group 4: Testing and Results - Comprehensive testing on the DeepSeek-V3 model demonstrated that OmniPlacement reduces average inference latency by approximately 10% compared to baseline methods, primarily due to dynamic expert allocation and communication domain optimization [12][14]. - The system's throughput improved by about 10%, reflecting a significant increase in resource utilization, especially in high-concurrency scenarios [14]. Group 5: Future Directions - Future research will focus on developing smarter scheduling algorithms and adaptive expert selection mechanisms to further enhance the system's adaptability to complex inputs [15][16]. - The OmniPlacement framework aims to expand its functionality to support more types of MoE models, increasing its versatility and applicability in various industrial settings [16].
华为:让DeepSeek的“专家们”动起来,推理延迟降10%!
量子位· 2025-05-20 05:12
Core Viewpoint - The article discusses Huawei's innovative approach to optimizing the performance of the Mixture of Experts (MoE) model through a technique called OmniPlacement, which addresses the load balancing issues between "hot" and "cold" experts, leading to significant improvements in inference latency and throughput. Group 1: MoE Model and Its Challenges - The MoE model allocates tasks to specialized expert networks, enhancing overall system performance [2] - Load balancing issues arise due to the uneven call frequency of expert networks, leading to performance limitations [3][5] - The disparity in call frequency can exceed an order of magnitude, causing delays in inference time and resource utilization [4][5] Group 2: Huawei's Solution - OmniPlacement - Huawei's OmniPlacement technique aims to optimize the deployment of experts to improve MoE model performance [8] - The approach involves three main steps: joint optimization based on computational balance, inter-layer redundant deployment of high-frequency experts, and near-real-time scheduling with dynamic monitoring [9][14][18] Group 3: Key Features of OmniPlacement - The OmniPlacement algorithm dynamically adjusts expert priorities and node allocations based on real-time statistics, reducing communication overhead [12] - The inter-layer redundant deployment strategy assigns additional instances to frequently called experts, alleviating their load and enhancing system throughput [15] - The near-real-time scheduling mechanism allows for dynamic resource allocation and predictive distribution based on historical data, improving system responsiveness [19][21] Group 4: Performance Improvements - The implementation of OmniPlacement in the DeepSeek-V3 system theoretically reduces inference latency by approximately 10% and increases throughput by about 10% [6][31] - The system demonstrates high adaptability across various MoE model scales and input data distributions, ensuring efficient resource utilization and stable operation [25][26] - The dynamic monitoring mechanism ensures rapid response to sudden load changes, maintaining system stability under high-demand scenarios [32] Group 5: Open Source Initiative - Huawei plans to open-source the OmniPlacement optimization method, promoting wider adoption and collaboration within the AI community [28]
DeepSeek-R1与Grok-3:AI规模扩展的两条技术路线启示
Counterpoint Research· 2025-04-09 13:01
自今年二月起,DeepSeek 便因其开源旗舰级推理模型DeepSeek-R1 而引发全球瞩目——该模型性能 堪比全球前沿推理模型。其独特价值不仅体现在卓越的性能表现,更在于仅使用约2000块NVIDIA H800 GPU 就完成了训练(H800 是H100 的缩减版出口合规替代方案),这一成就堪称效率优化的 典范。 几天后,Elon Musk 旗下xAI 发布了迄今最先进的Grok-3 模型,其性能表现略优于DeepSeek-R1、 OpenAI 的GPT-o1 以及谷歌的Gemini 2。与DeepSeek-R1 不同,Grok-3 属于闭源模型,其训练动用 了惊人的约20万块H100 GPU,依托xAI "巨像"超级计算机完成,标志着计算规模实现了巨大飞跃。 xAI "巨像" 数据中心 Grok-3 展现了无妥协的规模扩张——约200,000块NVIDIA H100 显卡追求前沿性能提升。而 DeepSeek-R1 仅用少量计算资源就实现了相近的性能,这表明创新的架构设计和数据策展能够 与蛮力计算相抗衡。 效率正成为一种趋势性策略,而非限制条件。DeepSeek 的成功重新定义了AI扩展方式的讨 论。我 ...
LIama 4发布重夺开源第一!DeepSeek同等代码能力但参数减一半,一张H100就能跑,还有两万亿参数超大杯
量子位· 2025-04-06 02:33
Core Viewpoint - Meta has launched the Llama 4 family of models, marking a significant advancement in multimodal AI capabilities, with Llama 4 Maverick achieving a high performance score in various benchmarks [3][4][8]. Group 1: Model Overview - The Llama 4 family includes three models: Llama 4 Scout, Llama 4 Maverick, and Llama 4 Behemoth, with the first two already released and the latter in training [3][4]. - Llama 4 Scout features 17 billion active parameters and a context window of 1 million tokens, while Llama 4 Maverick has 17 billion active parameters with 128 experts [5][19]. - Llama 4 Behemoth is a massive model with 2 trillion parameters, currently under training, and is expected to outperform existing models like GPT-4.5 and Claude Sonnet 3.7 [5][54]. Group 2: Performance Metrics - Llama 4 Maverick scored 1417 in the latest model ranking, surpassing previous models and becoming the top open-source model [8][9]. - The model outperformed Meta's previous Llama-3-405B by 149 points, marking a significant improvement [8]. - In various benchmarks, Llama 4 Scout demonstrated superior performance compared to competitors like Gemini 2.0 Flash-Lite and Mistral 3.1 [21][42]. Group 3: Multimodal Capabilities - Llama 4 models are designed for native multimodal functionality, allowing users to upload images and ask questions about them directly [30][41]. - The models are touted as the best in their class for multimodal applications, enhancing user interaction and experience [41][42]. Group 4: Cost Efficiency - Llama 4 Maverick offers competitive pricing, with inference costs significantly lower than other models like GPT-4, making it an attractive option for developers [46][49]. - The cost per million input and output tokens for Llama 4 Maverick ranges from $0.19 to $0.495, compared to $4.38 for GPT-4 [49]. Group 5: Training Innovations - The Llama 4 series utilizes a novel MoE (Mixture of Experts) architecture, enhancing computational efficiency by activating only a subset of parameters during inference [56][60]. - The training process involved over 30 trillion tokens, more than double that of Llama 3, and included diverse data types such as text, images, and videos [64][63]. - A new training technique called MetaP was developed to optimize model hyperparameters, resulting in improved performance across various tasks [62][63].