Engram模块
Search documents
DeepSeek开源大模型记忆模块,梁文锋署名新论文,下一代稀疏模型提前剧透
3 6 Ke· 2026-01-13 07:14
DeepSeek节前开始蓄力! 最新论文直接给Transformer加上"条件记忆"(Conditional Memory),补上了原生缺乏的知识查找机制。 结论中明写道:我们将条件记忆视为下一代稀疏模型不可或缺的建模原语。 还是梁文锋署名,并与北京大学王选所赵东岩、张辉帅团队合作。 论文中不仅提出了条件记忆这个全新范式,并给出了具体实现方案Engram模块,实验中让27B参数碾压同规模纯MoE模型,甚至变相提升了大模型的推 理能力: 让原来Transformer要用6层注意力才能干的简单任务压缩到1-2层搞定,省出来的资源就可以用于更难的推理任务了。 条件记忆的原理其实也非常"原始":不靠计算,回归查表,用上了传统N-gram方法。 给大模型一个巨大的词表,专门存那些固定的实体名称和两三个词的短语,不管词表多大,找信息都是O(1)速度。 关键就在于,如此前大模型时代的玩法,DeepSeek如何解决传统N-gram模型存储爆炸和多义性问题,又是让它和现代Transformer结合起来的? 让注意力干"苦力活"太浪费了 团队的核心观察是,语言建模其实包含两种性质完全不同的任务,一种是需要深度动态计算的组合推理, ...
刚刚,梁文锋署名开源“记忆”模块,DeepSeek V4更细节了
程序员的那些事· 2026-01-13 00:56
就在十几个小时前,DeepSeek 发布了一篇新论文,主题为《Conditional Memory via Scalable Lookup:A New Axis of Sparsity for Large Language Models》,与北京大 学合作完成,作者中同样有梁文锋署名。 转自:机器之心 论文地址:https://github.com/deepseek-ai/Engram/blob/main/Engram_paper.pdf 简单总结一波 这项新研究要解决的问题 :目前大语言模型主要通过混合专家(MoE)来实现稀疏化,这被称为「条件计算」。但是,现有的 Transformer 缺少原 生的知识查找机制,只能被迫通过计算过程低效地模拟检索行为。 针对这一现状, DeepSeek 提出了条件记忆(conditional memory),从而与 MoE 的条件计算互补,并通过引入一个新模块 Engram 来实现 。 目前,模块「Engram」相关的实现已经上传到了 GitHub。 项目地址:https://github.com/deepseek-ai/Engram 这让网友们感慨:「DeepSeek ...
刚刚,梁文锋署名开源「记忆」模块,DeepSeek V4更细节了
3 6 Ke· 2026-01-13 00:42
Core Insights - DeepSeek has released a new paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models," in collaboration with Peking University, introducing a new module called Engram to enhance the efficiency of large language models [1][3]. Group 1: Research Overview - The current approach to sparsity in large language models primarily relies on Mixture of Experts (MoE) for conditional computation, but existing Transformer architectures lack a native knowledge retrieval mechanism [3][8]. - DeepSeek proposes conditional memory as a complementary dimension to MoE, introducing the Engram module to facilitate efficient knowledge retrieval with O(1) time complexity [8][9]. Group 2: Engram Module Implementation - The Engram module has been implemented and made available on GitHub, allowing for community engagement and further development [4][5]. - Engram separates static memory storage from dynamic computation processes within the Transformer architecture, enhancing overall model performance [10][12]. Group 3: Performance Metrics - Engram has shown significant improvements in various benchmarks, including a +3.4% increase in MMLU accuracy and a +4.0% increase in CMMLU accuracy, as well as notable gains in general reasoning tasks [9][28]. - The architecture allows for better long-context retrieval capabilities, with accuracy in Multi-Query NIAH increasing from 84.2 to 97.0 [9]. Group 4: Experimental Results - DeepSeek trained four models: Dense-4B (4.1 billion parameters), MoE-27B (26.7 billion), Engram-27B (26.7 billion), and Engram-40B (39.5 billion), all under the same training conditions [25][27]. - The sparse architectures (MoE-27B, Engram-27B/40B) outperformed the dense model (Dense-4B) across all benchmarks, demonstrating superior scalability [28][30]. Group 5: Memory and Computation Decoupling - Engram's deterministic retrieval mechanism allows for the decoupling of parameter storage from computational resources, enabling efficient scaling without increasing computational costs [15][17]. - The architecture supports a multi-level cache hierarchy, optimizing memory access and reducing latency [18]. Group 6: U-Shaped Scaling Law - DeepSeek identified a U-shaped scaling law for optimal allocation between MoE and Engram, suggesting that a balanced distribution of sparse parameters leads to improved performance [19][24]. - The optimal allocation ratio was found to be around 20%-25% of the sparse parameter budget for Engram, confirming the structural complementarity between the two modules [23][24].
DeepSeek开源大模型记忆模块!梁文锋署名新论文,下一代稀疏模型提前剧透
量子位· 2026-01-13 00:39
Core Insights - The article discusses the introduction of "Conditional Memory" in Transformer models, which enhances knowledge retrieval mechanisms that were previously lacking in the original architecture [1][2][9]. Group 1: Introduction of Conditional Memory - Conditional Memory is viewed as an essential modeling primitive for the next generation of sparse models [2]. - The research team, led by Liang Wenfeng in collaboration with Peking University, has proposed a new paradigm and implementation plan called the Engram module [3][5]. Group 2: Performance Improvements - The Engram module allows a 27B parameter model to outperform a pure MoE model of the same size, compressing tasks that originally required 6 layers of attention down to 1-2 layers, thus freeing resources for more complex reasoning tasks [5][13]. - The optimal allocation of sparse parameters between MoE and Engram memory results in a U-shaped curve, indicating that allocating about 20% to 25% of sparse parameters to Engram memory minimizes model validation loss [34][36]. Group 3: Technical Implementation - Engram's design incorporates a large vocabulary for static entities and phrases, enabling O(1) speed for information retrieval [7][14]. - The team addresses traditional N-gram model issues, such as semantic redundancy and storage explosion, by compressing tokens and using multiple hash functions to map N-grams to a fixed-size embedding table [22][25]. Group 4: Experimental Results - The Engram-27B model shows significant improvements across various benchmarks, with notable increases in performance metrics such as BBH, ARC-Challenge, and DROP [47]. - The model's architecture allows for efficient memory management, enabling the use of a 100 billion parameter table offloaded to CPU memory without significant latency impact during inference [63][66]. Group 5: Future Developments - The next generation of sparse models from DeepSeek is expected to be released before the Spring Festival, indicating ongoing advancements in AI model architecture [67].
刚刚,梁文锋署名开源「记忆」模块,DeepSeek V4更细节了
机器之心· 2026-01-13 00:12
Core Insights - DeepSeek has introduced a new research paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models," in collaboration with Peking University, focusing on enhancing large language models (LLMs) through a novel approach to memory and computation [1][2]. Group 1: Research Background and Problem Statement - Current large language models primarily utilize Mixture of Experts (MoE) for sparsity, known as "conditional computation," but lack an inherent knowledge retrieval mechanism, leading to inefficient simulation of retrieval behavior [2][8]. - DeepSeek proposes "conditional memory" as a complementary approach to MoE, introducing a new module called Engram to address this limitation [3][8]. Group 2: Engram Module and Its Implementation - The Engram module has been made available on GitHub, allowing for community engagement and further development [4]. - Engram modernizes classic n-gram embeddings to achieve knowledge retrieval in O(1) time complexity, enhancing the efficiency of memory access [8][10]. - The module separates static knowledge storage from dynamic computation processes, enhancing the overall architecture of the Transformer network [12][14]. Group 3: Performance and Efficiency - DeepSeek has expanded Engram to a scale of 27 billion parameters, demonstrating significant performance improvements over pure MoE baseline models under equivalent parameter and FLOPs conditions [10][37]. - Engram has shown notable gains in knowledge retrieval tasks, with improvements such as +3.4 in MMLU and +4.0 in CMMLU, as well as enhanced general reasoning capabilities [10][37]. - The architecture allows for efficient memory access without additional performance overhead, supporting prefetching from host memory during runtime [11][18]. Group 4: Sparsity Distribution and Optimal Allocation - DeepSeek formalized a U-shaped expansion rule to characterize the optimal trade-off between neural computation (MoE) and static memory (Engram) [9][22]. - The research indicates that a balanced allocation of approximately 20%-25% of sparse parameter budget to Engram yields optimal performance, confirming the structural complementarity between the two modules [27][29]. Group 5: Experimental Results - Four models were trained: Dense-4B, MoE-27B, Engram-27B, and Engram-40B, all under identical training conditions [34][35]. - Sparse architectures consistently outperformed the dense model across various benchmarks, with Engram-27B achieving significant improvements over MoE-27B in multiple tasks [37]. - Engram-40B further reduced pre-training loss and improved performance on most benchmarks, indicating that memory capacity has not yet reached saturation [38]. Group 6: Long Context Training - Engram's architecture has been validated for its structural advantages in long-context tasks, demonstrating significant performance gains in global context retention [40][41]. - Controlled experiments revealed that Engram outperforms MoE in complex retrieval tasks, showcasing its inherent architectural superiority [45].