DeepSeek v4
Search documents
刚刚,梁文锋署名开源「记忆」模块,DeepSeek V4更细节了
3 6 Ke· 2026-01-13 00:42
Core Insights - DeepSeek has released a new paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models," in collaboration with Peking University, introducing a new module called Engram to enhance the efficiency of large language models [1][3]. Group 1: Research Overview - The current approach to sparsity in large language models primarily relies on Mixture of Experts (MoE) for conditional computation, but existing Transformer architectures lack a native knowledge retrieval mechanism [3][8]. - DeepSeek proposes conditional memory as a complementary dimension to MoE, introducing the Engram module to facilitate efficient knowledge retrieval with O(1) time complexity [8][9]. Group 2: Engram Module Implementation - The Engram module has been implemented and made available on GitHub, allowing for community engagement and further development [4][5]. - Engram separates static memory storage from dynamic computation processes within the Transformer architecture, enhancing overall model performance [10][12]. Group 3: Performance Metrics - Engram has shown significant improvements in various benchmarks, including a +3.4% increase in MMLU accuracy and a +4.0% increase in CMMLU accuracy, as well as notable gains in general reasoning tasks [9][28]. - The architecture allows for better long-context retrieval capabilities, with accuracy in Multi-Query NIAH increasing from 84.2 to 97.0 [9]. Group 4: Experimental Results - DeepSeek trained four models: Dense-4B (4.1 billion parameters), MoE-27B (26.7 billion), Engram-27B (26.7 billion), and Engram-40B (39.5 billion), all under the same training conditions [25][27]. - The sparse architectures (MoE-27B, Engram-27B/40B) outperformed the dense model (Dense-4B) across all benchmarks, demonstrating superior scalability [28][30]. Group 5: Memory and Computation Decoupling - Engram's deterministic retrieval mechanism allows for the decoupling of parameter storage from computational resources, enabling efficient scaling without increasing computational costs [15][17]. - The architecture supports a multi-level cache hierarchy, optimizing memory access and reducing latency [18]. Group 6: U-Shaped Scaling Law - DeepSeek identified a U-shaped scaling law for optimal allocation between MoE and Engram, suggesting that a balanced distribution of sparse parameters leads to improved performance [19][24]. - The optimal allocation ratio was found to be around 20%-25% of the sparse parameter budget for Engram, confirming the structural complementarity between the two modules [23][24].
刚刚,梁文锋署名开源「记忆」模块,DeepSeek V4更细节了
机器之心· 2026-01-13 00:12
Core Insights - DeepSeek has introduced a new research paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models," in collaboration with Peking University, focusing on enhancing large language models (LLMs) through a novel approach to memory and computation [1][2]. Group 1: Research Background and Problem Statement - Current large language models primarily utilize Mixture of Experts (MoE) for sparsity, known as "conditional computation," but lack an inherent knowledge retrieval mechanism, leading to inefficient simulation of retrieval behavior [2][8]. - DeepSeek proposes "conditional memory" as a complementary approach to MoE, introducing a new module called Engram to address this limitation [3][8]. Group 2: Engram Module and Its Implementation - The Engram module has been made available on GitHub, allowing for community engagement and further development [4]. - Engram modernizes classic n-gram embeddings to achieve knowledge retrieval in O(1) time complexity, enhancing the efficiency of memory access [8][10]. - The module separates static knowledge storage from dynamic computation processes, enhancing the overall architecture of the Transformer network [12][14]. Group 3: Performance and Efficiency - DeepSeek has expanded Engram to a scale of 27 billion parameters, demonstrating significant performance improvements over pure MoE baseline models under equivalent parameter and FLOPs conditions [10][37]. - Engram has shown notable gains in knowledge retrieval tasks, with improvements such as +3.4 in MMLU and +4.0 in CMMLU, as well as enhanced general reasoning capabilities [10][37]. - The architecture allows for efficient memory access without additional performance overhead, supporting prefetching from host memory during runtime [11][18]. Group 4: Sparsity Distribution and Optimal Allocation - DeepSeek formalized a U-shaped expansion rule to characterize the optimal trade-off between neural computation (MoE) and static memory (Engram) [9][22]. - The research indicates that a balanced allocation of approximately 20%-25% of sparse parameter budget to Engram yields optimal performance, confirming the structural complementarity between the two modules [27][29]. Group 5: Experimental Results - Four models were trained: Dense-4B, MoE-27B, Engram-27B, and Engram-40B, all under identical training conditions [34][35]. - Sparse architectures consistently outperformed the dense model across various benchmarks, with Engram-27B achieving significant improvements over MoE-27B in multiple tasks [37]. - Engram-40B further reduced pre-training loss and improved performance on most benchmarks, indicating that memory capacity has not yet reached saturation [38]. Group 6: Long Context Training - Engram's architecture has been validated for its structural advantages in long-context tasks, demonstrating significant performance gains in global context retention [40][41]. - Controlled experiments revealed that Engram outperforms MoE in complex retrieval tasks, showcasing its inherent architectural superiority [45].