大模型架构创新
Search documents
DeepSeek连发两篇论文背后,原来藏着一场学术接力
3 6 Ke· 2026-01-16 01:28
Core Insights - The article discusses the evolution of DeepSeek's research, particularly focusing on their recent papers, mHC and Conditional Memory, which build upon previous works by ByteSeed and others in the field of AI and deep learning [1][2]. Group 1: mHC and Its Innovations - mHC builds on the Hyper-Connections (HC) framework proposed by ByteSeed, significantly improving the stability and scalability of deep learning models [4][7]. - The core innovation of mHC lies in its ability to expand the width of residual flows and introduce dynamic hyper connections, which enhances model capacity without increasing computational costs [4][6]. - mHC addresses the stability issues encountered in HC during large-scale training by implementing manifold constraints and optimizing infrastructure, making it suitable for industrial applications with trillions of parameters [7][8]. Group 2: Conditional Memory and N-gram Utilization - The Conditional Memory paper introduces the concept of using an "Engram" to allow models to reference a large phrase dictionary, improving efficiency in answering straightforward questions [9][12]. - This approach contrasts with previous methods by suggesting that integrating N-gram lookups can free up computational resources for more complex reasoning tasks [10][13]. - DeepSeek's research indicates that allocating a portion of parameters to Engram can yield better performance than solely relying on Mixture of Experts (MoE) models, revealing a U-shaped scaling law [13][19]. Group 3: Collaborative Research and Community Impact - The collaboration between DeepSeek and ByteSeed exemplifies the value of open research in advancing AI technologies, showcasing how shared insights can lead to significant breakthroughs [19][20]. - The article highlights various innovative approaches from ByteSeed, such as UltraMem and Seed Diffusion Preview, which contribute to the ongoing evolution of deep learning architectures [20].
DeepSeek连发两篇论文背后,原来藏着一场学术接力
机器之心· 2026-01-16 00:42
Core Insights - The article discusses the evolution of deep learning architectures, particularly focusing on the advancements made by DeepSeek and ByteSeed in the context of residual connections and knowledge retrieval mechanisms [1][4]. Group 1: Deep Learning Architecture Evolution - The introduction of residual connections by He et al. in 2015 addressed the issue of information loss in deep neural networks, becoming a foundational element in deep learning [6][15]. - ByteSeed's introduction of Hyper-Connections (HC) in 2024 significantly enhanced network topology complexity without increasing computational costs, marking a shift from traditional residual connections [8][9]. - DeepSeek's mHC builds upon HC by addressing its scalability issues, improving stability and memory access efficiency for large-scale training [11][12]. Group 2: Knowledge Retrieval Mechanisms - DeepSeek's "Conditional Memory" paper proposes a method for efficient knowledge retrieval using an "Engram" system, allowing models to reference a large phrase dictionary for common queries, thus saving computational resources [18][21]. - The research highlights the importance of parameter allocation between MoE (Mixture of Experts) and static storage modules, revealing that allocating 20%-25% of parameters to Engram yields better performance [22]. - DeepSeek's approach integrates the Engram module into intermediate layers of the model, enhancing the efficiency of storage access and deep computation [22][23]. Group 3: Collaborative Research Impact - The collaboration and knowledge sharing between DeepSeek and ByteSeed exemplify the value of open research in advancing AI technologies, as both teams build upon each other's findings [28][29]. - The article emphasizes the importance of continuous exploration and innovation in foundational technologies, which may not yield immediate commercial applications but contribute to long-term industry progress [31].
阿里深夜干了件大事,成本暴降90%
3 6 Ke· 2025-09-12 02:45
Core Insights - Alibaba's Tongyi Lab has officially released the next-generation foundational model architecture Qwen3-Next, which includes the Qwen3-Next-80B-A3B-Base model with 80 billion parameters, activating only 3 billion parameters [1][21] - The new model architecture is designed to enhance performance while significantly reducing training costs, achieving over 10 times the inference throughput compared to the previous Qwen3-32B model under long context scenarios [1][8][21] Model Performance - The instruction model of Qwen3-Next-80B-A3B performs comparably to the larger Qwen3-235B-A22B-Instruct-2507 model, while the thinking model outperforms Google's closed-source model Gemini-2.5-Flash-Thinking [2][12] - In various benchmark tests, Qwen3-Next-80B-A3B-Base shows performance similar to Qwen3-32B-Base, but with training costs less than 10% of Qwen3-32B-Base [6][21] Architectural Innovations - Qwen3-Next introduces several architectural innovations, including a hybrid attention mechanism, high sparsity MoE structure, and a multi-token prediction (MTP) mechanism, which collectively enhance inference efficiency and model stability [5][16][19] - The hybrid attention mechanism combines Gated DeltaNet and Gated Attention to improve context modeling for long sequences, achieving a 1:50 activation ratio in MoE layers, significantly reducing FLOPS per token [18][19] Training Efficiency - The model utilizes a subset of 15 trillion tokens from the Qwen3 36T pre-training corpus, requiring only 9.3% of the GPU resources compared to Qwen3-32B, while delivering superior performance [16][21] - The MTP mechanism optimizes multi-step inference performance, enhancing the acceptance rate of speculative decoding in practical applications [19] Future Developments - Alibaba plans to continue optimizing the Qwen3-Next architecture and is developing Qwen3.5, alongside launching various models across different domains, thereby increasing its technical influence in the open-source community [21]
大模型专题:大模型架构创新研究报告
Sou Hu Cai Jing· 2025-06-06 11:38
Core Insights - The report focuses on innovations in large model architectures, particularly addressing the limitations of the Transformer architecture and exploring industry pathways for improvement [1][2][7] - As model sizes increase, the secondary computational complexity of Transformers (O(n²)) leads to significant power consumption and efficiency bottlenecks in processing long sequences, prompting a demand for innovative solutions [1][2][15] - The industry is currently exploring two main paths for architectural breakthroughs: improvements to the Transformer architecture and exploration of non-Transformer architectures [1][2][7] Transformer Architecture Improvements - Improvements to the Transformer architecture focus on optimizing the Attention mechanism, Feed-Forward Network (FFN) layers, and normalization layers [1][2][18] - Techniques such as sparse attention and dynamic attention are being developed to enhance computational efficiency, while Mixture of Experts (MoE) aims to improve sparse connection efficiency in FFN layers [1][2][18] - LongRoPE and other technologies are enhancing positional encoding to better model long sequences [1][2][18] Non-Transformer Architecture Exploration - Non-Transformer architectures include new types of RNNs (e.g., RWKV, Mamba) and CNNs (e.g., Hyena Hierarchy), as well as other innovative architectures like RetNet and LFM [1][2][7] - RWKV optimizes state evolution through a generalized Delta Rule, while Mamba leverages state space models to enhance training efficiency [1][2][7] - RetNet combines state space and multi-head attention to achieve parallel computation [1][2][7] Industry Trends and Future Directions - The industry is witnessing a trend towards hybrid architectures that combine linear Transformers with non-Transformer architectures, balancing performance and efficiency [2][7] - The current phase is characterized by a peak in traditional Transformer paradigms and an impending wave of architectural innovations, with significant focus on new RNN/CNN theoretical breakthroughs and practical engineering optimizations [2][7] - Companies like ByteDance and Alibaba are accelerating their investments in hybrid architectures, driving the evolution of large models towards higher efficiency and lower energy consumption [2][7]