条件记忆
Search documents
梁文锋署名DeepSeek新论文发布,直指大模型“记忆”短板
Bei Ke Cai Jing· 2026-01-13 04:41
Core Insights - The paper published by DeepSeek addresses the memory limitations of current large language models and introduces the concept of "conditional memory" [2] - DeepSeek proposes a module named Engram, which breaks down language modeling tasks into two branches: "static pattern retrieval" for quick access to deterministic knowledge and "dynamic combinatorial reasoning" for complex logical operations [2] - The paper suggests that conditional memory is an essential modeling primitive for the next generation of sparse models, with speculation that DeepSeek's next model may be released before the Spring Festival [3] Group 1 - The paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models" was co-authored by Peking University and DeepSeek [1] - The introduction of "conditional memory" aims to enhance the memory capabilities of large language models [2] - The Engram module is designed to improve efficiency in language modeling by separating tasks into static and dynamic components [2] Group 2 - The paper emphasizes the importance of conditional memory for future sparse model development [3] - There are speculations regarding the release of DeepSeek's next-generation model around the Spring Festival, potentially replicating the success of previous launches [3]
DeepSeek发布梁文锋署名新论文
证券时报· 2026-01-13 03:27
Core Viewpoint - DeepSeek released a new paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models," which introduces conditional memory to enhance model performance in various tasks under equal parameters and computational conditions [1]. Group 1 - The paper was co-authored by Peking University and DeepSeek, with Liang Wenfeng listed as a co-author [1]. - Conditional memory is proposed to significantly improve model performance in knowledge retrieval, reasoning, coding, and mathematical tasks [1]. - DeepSeek has open-sourced a related memory module called Engram [1].
DeepSeek发布梁文锋署名新论文
Zheng Quan Shi Bao· 2026-01-13 03:02
Core Insights - DeepSeek released a new paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models" on the evening of the 12th [1] - The paper was co-authored by Peking University and DeepSeek, with Liang Wenfeng listed as a co-author [1] - The concept of conditional memory is introduced, which significantly enhances model performance in knowledge retrieval, reasoning, coding, and mathematical tasks under equal parameters and computational conditions [1] - DeepSeek has also open-sourced a related memory module named Engram [1] Company and Industry Summary - The collaboration between DeepSeek and Peking University highlights the growing trend of partnerships between academia and industry in advancing AI technologies [1] - The introduction of scalable lookup structures in large language models represents a significant innovation in the field, potentially leading to improved efficiency and effectiveness in AI applications [1] - The open-sourcing of the Engram memory module may encourage further research and development in conditional memory systems, fostering a more collaborative environment in AI advancements [1]
梁文锋署名新论文,DeepSeek V4架构首曝?直击Transformer致命缺陷
3 6 Ke· 2026-01-13 01:24
Core Insights - DeepSeek's new paper introduces a novel approach to address the memory limitations of Transformer models by proposing a complementary "conditional memory" sparse axis through the Engram module, which enables efficient knowledge retrieval with near O(1) complexity [1][6][11]. Group 1: Memory and Model Architecture - The paper highlights that while MoE (Mixture of Experts) has become a mainstream architecture for large models, it fundamentally still relies on Transformers, which lack a native knowledge retrieval mechanism, leading to inefficient computation [9][11]. - Engram is designed to offload static, repetitive patterns in language modeling to a scalable lookup module, allowing the Transformer backbone to focus on more complex tasks requiring combination and reasoning [11][15]. - The authors categorize language modeling tasks into two types: those requiring combination and reasoning, and those resembling pattern retrieval, emphasizing the need for a dedicated mechanism for the latter [12][13]. Group 2: Engram Architecture and Functionality - Engram is conceptualized as a modernized version of classic hash N-gram, functioning as a scalable lookup module integrated within the Transformer architecture [18][20]. - The architecture includes a two-stage process for handling input sequences, focusing on retrieval and fusion, which enhances the model's efficiency in processing static patterns [20][21]. - The introduction of a context-aware gating mechanism allows the model to dynamically adjust its responses based on the retrieved embeddings, improving the overall expressiveness and reducing noise from hash collisions [25][27]. Group 3: Performance and Scaling - The paper presents a U-shaped scaling law indicating that an optimal resource allocation between MoE and Engram can enhance model performance, suggesting that a balance between dynamic computation and static memory is crucial [3][33]. - Experimental results show that Engram, when scaled to 27 billion parameters, outperforms the MoE baseline under equivalent parameter and FLOPs conditions, demonstrating its effectiveness in various benchmarks [5][38]. - Engram's architecture not only improves knowledge retrieval but also enhances reasoning, mathematics, and coding capabilities, indicating a significant leap in performance metrics across multiple tasks [39][48]. Group 4: Future Implications - The findings suggest a paradigm shift in model architecture towards a dual-axis approach of computation and memory, with potential integration into future iterations of large language models, such as V4 [46][50]. - The paper posits that the integration of Engram could lead to substantial improvements in model efficiency and capability, paving the way for more advanced applications in natural language processing [51][52].
刚刚,梁文锋署名开源“记忆”模块,DeepSeek V4更细节了
程序员的那些事· 2026-01-13 00:56
Core Insights - DeepSeek has introduced a new research paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models," in collaboration with Peking University, focusing on enhancing large language models (LLMs) through conditional memory and a new module called Engram [1][3][4]. Group 1: Research Background and Problem Statement - Current large language models primarily utilize Mixture of Experts (MoE) for sparsity, but existing Transformer architectures lack native knowledge retrieval mechanisms, leading to inefficient simulation of retrieval behavior [3][9]. - DeepSeek proposes conditional memory as a complementary approach to MoE, introducing the Engram module to address the limitations of current models [4][9]. Group 2: Engram Module and Its Functionality - The Engram module modernizes classic n-gram embeddings, enabling knowledge retrieval with O(1) time complexity [9]. - Engram separates static knowledge storage from dynamic computation processes, enhancing the model's ability to perform complex reasoning by offloading the reconstruction burden from the model's shallow layers [11][13]. Group 3: Performance Improvements - Engram has been scaled to 27 billion parameters, showing significant performance improvements over pure MoE baseline models under equivalent parameter and FLOPs conditions [11]. - Notably, Engram enhances knowledge retrieval capabilities, with improvements in metrics such as MMLU (+3.4), CMMLU (+4.0), and general reasoning tasks like BBH (+5.0) and ARC-Challenge (+3.7) [11][38]. Group 4: System Efficiency and Scalability - Engram's deterministic addressing supports prefetching from host memory at runtime with minimal performance overhead, allowing for efficient memory management [12][19]. - The architecture allows for the decoupling of parameter storage from computational resources, facilitating linear scalability with the number of accelerators [21][22]. Group 5: Experimental Results - Four models were trained: Dense-4B, MoE-27B, Engram-27B, and Engram-40B, all using the same training data and processes [35][36]. - Sparse architectures (MoE-27B, Engram-27B/40B) significantly outperformed the dense model (Dense-4B) across various benchmarks, demonstrating superior scaling properties [38]. Group 6: Long Context Training - Engram architecture has shown significant advantages in long-context tasks by preserving valuable attention capacity for global context processing [41]. - Controlled experiments indicate that Engram outperforms MoE models in complex retrieval tasks, confirming its architectural superiority [46].
刚刚,梁文锋署名开源「记忆」模块,DeepSeek V4更细节了
3 6 Ke· 2026-01-13 00:42
Core Insights - DeepSeek has released a new paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models," in collaboration with Peking University, introducing a new module called Engram to enhance the efficiency of large language models [1][3]. Group 1: Research Overview - The current approach to sparsity in large language models primarily relies on Mixture of Experts (MoE) for conditional computation, but existing Transformer architectures lack a native knowledge retrieval mechanism [3][8]. - DeepSeek proposes conditional memory as a complementary dimension to MoE, introducing the Engram module to facilitate efficient knowledge retrieval with O(1) time complexity [8][9]. Group 2: Engram Module Implementation - The Engram module has been implemented and made available on GitHub, allowing for community engagement and further development [4][5]. - Engram separates static memory storage from dynamic computation processes within the Transformer architecture, enhancing overall model performance [10][12]. Group 3: Performance Metrics - Engram has shown significant improvements in various benchmarks, including a +3.4% increase in MMLU accuracy and a +4.0% increase in CMMLU accuracy, as well as notable gains in general reasoning tasks [9][28]. - The architecture allows for better long-context retrieval capabilities, with accuracy in Multi-Query NIAH increasing from 84.2 to 97.0 [9]. Group 4: Experimental Results - DeepSeek trained four models: Dense-4B (4.1 billion parameters), MoE-27B (26.7 billion), Engram-27B (26.7 billion), and Engram-40B (39.5 billion), all under the same training conditions [25][27]. - The sparse architectures (MoE-27B, Engram-27B/40B) outperformed the dense model (Dense-4B) across all benchmarks, demonstrating superior scalability [28][30]. Group 5: Memory and Computation Decoupling - Engram's deterministic retrieval mechanism allows for the decoupling of parameter storage from computational resources, enabling efficient scaling without increasing computational costs [15][17]. - The architecture supports a multi-level cache hierarchy, optimizing memory access and reducing latency [18]. Group 6: U-Shaped Scaling Law - DeepSeek identified a U-shaped scaling law for optimal allocation between MoE and Engram, suggesting that a balanced distribution of sparse parameters leads to improved performance [19][24]. - The optimal allocation ratio was found to be around 20%-25% of the sparse parameter budget for Engram, confirming the structural complementarity between the two modules [23][24].
DeepSeek开源大模型记忆模块!梁文锋署名新论文,下一代稀疏模型提前剧透
量子位· 2026-01-13 00:39
Core Insights - The article discusses the introduction of "Conditional Memory" in Transformer models, which enhances knowledge retrieval mechanisms that were previously lacking in the original architecture [1][2][9]. Group 1: Introduction of Conditional Memory - Conditional Memory is viewed as an essential modeling primitive for the next generation of sparse models [2]. - The research team, led by Liang Wenfeng in collaboration with Peking University, has proposed a new paradigm and implementation plan called the Engram module [3][5]. Group 2: Performance Improvements - The Engram module allows a 27B parameter model to outperform a pure MoE model of the same size, compressing tasks that originally required 6 layers of attention down to 1-2 layers, thus freeing resources for more complex reasoning tasks [5][13]. - The optimal allocation of sparse parameters between MoE and Engram memory results in a U-shaped curve, indicating that allocating about 20% to 25% of sparse parameters to Engram memory minimizes model validation loss [34][36]. Group 3: Technical Implementation - Engram's design incorporates a large vocabulary for static entities and phrases, enabling O(1) speed for information retrieval [7][14]. - The team addresses traditional N-gram model issues, such as semantic redundancy and storage explosion, by compressing tokens and using multiple hash functions to map N-grams to a fixed-size embedding table [22][25]. Group 4: Experimental Results - The Engram-27B model shows significant improvements across various benchmarks, with notable increases in performance metrics such as BBH, ARC-Challenge, and DROP [47]. - The model's architecture allows for efficient memory management, enabling the use of a 100 billion parameter table offloaded to CPU memory without significant latency impact during inference [63][66]. Group 5: Future Developments - The next generation of sparse models from DeepSeek is expected to be released before the Spring Festival, indicating ongoing advancements in AI model architecture [67].