Workflow
Mixture of Experts (MoE)
icon
Search documents
Flash Attention作者最新播客:英伟达GPU统治三年内将终结
量子位· 2025-09-29 04:57
Group 1 - The core argument is that Nvidia's dominance in the GPU market will face increasing competition within the next 2-3 years as specialized chips for different workloads emerge, leading to a more diversified ecosystem [6][9][23] - Tri Dao emphasizes that the architecture for AI models, particularly the Transformer, is stabilizing, but there are still ongoing changes and challenges in chip design and workload adaptation [11][12][21] - The future of AI workloads will include three main types: traditional chatbots, ultra-low latency scenarios, and large-scale batch processing, which will require tailored optimizations from hardware vendors [24][96] Group 2 - The cost of inference has decreased by approximately 100 times since the launch of ChatGPT, driven by improvements in model efficiency and inference optimization techniques [73][75][90] - Techniques such as model quantization and collaborative design between model architecture and hardware have significantly contributed to this cost reduction [82][84][88] - There is still an estimated potential for a further 10-fold improvement in inference optimization, particularly through specialized hardware and model advancements [90][93][95] Group 3 - The AI hardware landscape is expected to diversify as companies like Cerebras, Grok, and SambaNova introduce solutions that emphasize low-latency inference and high throughput for various applications [23][24][96] - The emergence of specialized AI inference providers will lead to different trade-offs, with some focusing on broad coverage while others aim for excellence in specific scenarios [96][97] - The evolution of AI workloads will continue to drive demand for innovative solutions, particularly in real-time video generation and agentic applications that require seamless integration with human tools [117][115][120]
deepseek技术解读(3)-MoE的演进之路
自动驾驶之心· 2025-07-06 08:44
Core Viewpoint - The article discusses the evolution of DeepSeek in the context of Mixture-of-Experts (MoE) models, highlighting innovations and improvements from DeepSeekMoE (V1) to DeepSeek V3, while maintaining a focus on the MoE technology route [1]. Summary by Sections 1. Development History of MoE - MoE was first introduced in 1991 with the paper "Adaptive Mixtures of Local Experts," and its framework has remained consistent over the years [2]. - Google has been a key player in the development of MoE, particularly with the release of "GShard" in 2020, which scaled models to 600 billion parameters [5]. 2. DeepSeek's Work 2.1. DeepSeek-MoE (V1) - DeepSeek V1 was released in January 2024, addressing two main issues: knowledge mixing and redundancy among experts [15]. - The architecture introduced fine-grained expert segmentation and shared expert isolation to enhance specialization and reduce redundancy [16]. 2.2. DeepSeek V2 MoE Upgrade - V2 introduced a device-limited routing mechanism to control communication costs by ensuring that activated experts are distributed across a limited number of devices [28]. - A communication balance loss was added to address potential congestion issues at the receiving end of the communication [29]. 2.3. DeepSeek V3 MoE Upgrade - V3 maintained the fine-grained expert and shared expert designs while upgrading the gating network from Softmax to Sigmoid to improve scoring differentiation among experts [36][38]. - The auxiliary loss for load balancing was eliminated to reduce its negative impact on the main model, replaced by a dynamic bias for load balancing [40]. - A sequence-wise auxiliary loss was introduced to balance token distribution among experts at the sequence level [42]. 3. Summary of DeepSeek's Innovations - The evolution of DeepSeek MoE has focused on balancing general knowledge and specialized knowledge through shared and fine-grained experts, while also addressing load balancing through various auxiliary losses [44].