Workflow
多头潜在注意力机制(MLA)
icon
Search documents
硬核拆解大模型,从 DeepSeek-V3 到 Kimi K2 ,一文看懂 LLM 主流架构
机器之心· 2025-08-07 09:42
Core Viewpoint - The article discusses the evolution of large language models (LLMs) over the past seven years, highlighting that while model capabilities have improved, the overall architecture has remained consistent. It questions whether there have been any disruptive innovations or if advancements have been incremental within the existing framework [2][5]. Group 1: Architectural Innovations - The article details eight mainstream LLMs, including DeepSeek and Kimi, analyzing their architectural designs and innovative approaches [5]. - DeepSeek V3, released in December 2024, introduced key architectural technologies that enhanced computational efficiency, distinguishing it among other LLMs [10][9]. - The multi-head latent attention mechanism (MLA) is introduced as a memory-saving strategy that compresses key and value tensors into a lower-dimensional latent space, significantly reducing memory usage during inference [18][22]. Group 2: Mixture-of-Experts (MoE) - The MoE layer in the DeepSeek architecture allows for multiple parallel feedforward submodules, significantly increasing the model's parameter capacity while reducing computational costs during inference through sparse activation [23][30]. - DeepSeek V3 features 256 experts in each MoE module, with a total parameter count of 671 billion, but only activates 9 experts per token during inference [30]. Group 3: OLMo 2 and Its Design Choices - OLMo 2 is noted for its high transparency in training data and architecture, which serves as a reference for LLM development [32][34]. - The architecture of OLMo 2 includes a unique normalization strategy, utilizing RMSNorm and QK-norm to enhance training stability [38][46]. Group 4: Gemma 3 and Sliding Window Attention - Gemma 3 employs a sliding window attention mechanism to reduce memory requirements for key-value (KV) caching, representing a shift towards local attention mechanisms [53][60]. - The architecture of Gemma 3 also features a dual normalization strategy, combining Pre-Norm and Post-Norm approaches [62][68]. Group 5: Mistral Small 3.1 and Performance - Mistral Small 3.1, released in March 2023, outperforms Gemma 3 in several benchmarks, attributed to its custom tokenizer and reduced KV cache size [73][75]. - Mistral Small 3.1 adopts a standard architecture without the sliding window attention mechanism used in Gemma 3 [76]. Group 6: Llama 4 and MoE Adoption - Llama 4 incorporates MoE architecture, similar to DeepSeek V3, but with notable differences in the activation of experts and overall design [80][84]. - The MoE architecture has seen significant development and adoption in 2025, indicating a trend towards more complex and capable models [85]. Group 7: Kimi K2 and Its Innovations - Kimi K2, with a parameter count of 1 trillion, is recognized as one of the largest LLMs, utilizing the Muon optimizer variant for improved training performance [112][115]. - The architecture of Kimi K2 is based on DeepSeek V3 but expands upon its design, showcasing the ongoing evolution of LLM architectures [115].
理想的VLA可以类比DeepSeek的MoE
理想TOP2· 2025-06-08 04:24
Core Viewpoint - The article discusses the advancements and innovations in the VLA (Vision Language Architecture) and its comparison with DeepSeek's MoE (Mixture of Experts), highlighting the unique approaches and improvements in model architecture and training processes. Group 1: VLA and MoE Comparison - Both VLA and MoE have been previously proposed concepts but are now being fully realized in new domains with significant innovations and positive outcomes [2] - DeepSeek's MoE has improved upon traditional models by increasing the number of specialized experts and enhancing parameter utilization through Fine-Grained Expert Segmentation and Shared Expert Isolation [2] Group 2: Key Technical Challenges for VLA - The VLA needs to address six critical technical points, including the design and training processes, 3D spatial understanding, and real-time inference capabilities [4] - The design of the VLA base model requires a focus on sparsity to expand parameter capacity without significantly increasing inference load [6] Group 3: Model Training and Efficiency - The training process incorporates a significant amount of 3D data and driving-related information while reducing the proportion of historical data [7] - The model is designed to learn human thought processes, utilizing both fast and slow reasoning methods to balance parameter scale and real-time performance [8] Group 4: Diffusion and Trajectory Generation - Diffusion techniques are employed to decode action tokens into driving trajectories, enhancing the model's ability to predict complex traffic scenarios [9] - The use of an ODE sampler accelerates the diffusion generation process, allowing for stable trajectory generation in just 2-3 steps [11] Group 5: Reinforcement Learning and Model Training - The system aims to surpass human driving capabilities through reinforcement learning, addressing previous limitations related to training environments and information transfer [12] - The model has achieved end-to-end trainability, enhancing its ability to generate realistic 3D environments for training [12] Group 6: Positioning Against Competitors - The company is no longer seen as merely following Tesla in the autonomous driving space, especially since the introduction of V12, which marks a shift in its approach [13] - The VLM (Vision Language Model) consists of fast and slow systems, with the fast system being comparable to Tesla's capabilities, while the slow system represents a unique approach due to resource constraints [14] Group 7: Evolution of VLM to VLA - The development of VLM is viewed as a natural evolution towards VLA, indicating that the company is not just imitating competitors but innovating based on its own insights [15]
半导体:AI算力芯片是“AI时代的引擎”,河南省着力布局
Zhongyuan Securities· 2025-03-20 09:00
Investment Rating - The report does not explicitly state an investment rating for the semiconductor industry Core Insights - AI computing chips are considered the "engine of the AI era," with significant growth in global computing demand driven by the ChatGPT trend and the acceleration of AI model iterations [6][12] - The global computing scale is expected to grow from 1397 EFLOPS in 2023 to 16 ZFLOPS by 2030, with a compound annual growth rate (CAGR) of 50% from 2023 to 2030 [25][28] - The AI computing chip market is dominated by GPUs, with a rapid growth in the custom ASIC chip market anticipated due to the increasing demand for AI computing [6][42] Summary by Sections 1. AI Computing Chips as the "Engine of the AI Era" - The ChatGPT trend has led to a surge in global tech companies accelerating their AI model development, with major players like Google, Meta, and Alibaba launching and iterating on AI models [6][12] - The demand for AI servers, which are essential for generative AI applications, is expected to drive significant growth in the AI server market, projected to reach $158.7 billion by 2025 [29] 2. Dominance of GPUs and Growth of Custom ASIC Market - AI computing chips are primarily used in cloud, edge, and terminal applications, with GPUs currently being the mainstream choice [6][42] - NVIDIA holds a dominant position in the global GPU market, with over 95% market share in AI server acceleration chips [42] - The custom ASIC chip market is expected to grow rapidly, with a projected CAGR of 45% from 2023 to 2028, driven by the need for diversified supply chains and enhanced bargaining power among cloud vendors [6][42] 3. DeepSeek's Role in Accelerating Domestic AI Computing Chip Development - DeepSeek's technological innovations are expected to enhance the efficiency of domestic AI computing chips, facilitating their rapid development and increasing market share [6][7] 4. Development of AI Computing Chip Industry in Henan Province - Henan Province is focusing on building a robust AI computing chip industry, establishing a core hub for computing resource scheduling and attracting upstream chip enterprises [9][10]
AI算力芯片是“AI时代的引擎”,河南省着力布局
Zhongyuan Securities· 2025-03-20 08:45
Investment Rating - The report does not explicitly state an investment rating for the semiconductor industry Core Insights - AI computing chips are considered the "engine of the AI era," with significant growth in global computing demand driven by the ChatGPT trend and the acceleration of AI model iterations [6][12] - The global computing scale is expected to grow from 1,397 EFLOPS in 2023 to 16 ZFLOPS by 2030, with a compound annual growth rate (CAGR) of 50% from 2023 to 2030 [6][25] - The AI server market is projected to reach $125.1 billion in 2024 and $158.7 billion in 2025, with a CAGR of 15.5% from 2024 to 2028 [29] Summary by Sections 1. AI Computing Chips as the "Engine of the AI Era" - The ChatGPT trend has led to a rapid iteration of AI models by major tech companies, significantly increasing global computing demand [12][19] - AI servers are the core infrastructure supporting generative AI applications, with a growing need for high-performance computing resources [28][29] 2. Dominance of GPU and Growth of Custom ASIC Market - AI computing chips are primarily based on GPUs, with a significant market share held by NVIDIA, which dominates the global AI chip market [42][45] - The custom ASIC chip market is expected to grow rapidly, driven by cloud vendors seeking to diversify supply chains and enhance bargaining power [6][7] 3. DeepSeek's Role in Accelerating Domestic AI Computing Chip Development - DeepSeek's technological innovations are expected to enhance the efficiency of domestic AI computing chips, facilitating their rapid development and market share growth [6][7] 4. Henan Province's Focus on AI Computing Chips - Henan Province is actively developing its AI computing chip industry, establishing a foundational ecosystem and attracting key enterprises [9][10]