刚刚，梁文锋署名开源“记忆”模块，DeepSeek V4更细节了

Core Insights - DeepSeek has introduced a new research paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models," in collaboration with Peking University, focusing on enhancing large language models (LLMs) through conditional memory and a new module called Engram [1][3][4]. Group 1: Research Background and Problem Statement - Current large language models primarily utilize Mixture of Experts (MoE) for sparsity, but existing Transformer architectures lack native knowledge retrieval mechanisms, leading to inefficient simulation of retrieval behavior [3][9]. - DeepSeek proposes conditional memory as a complementary approach to MoE, introducing the Engram module to address the limitations of current models [4][9]. Group 2: Engram Module and Its Functionality - The Engram module modernizes classic n-gram embeddings, enabling knowledge retrieval with O(1) time complexity [9]. - Engram separates static knowledge storage from dynamic computation processes, enhancing the model's ability to perform complex reasoning by offloading the reconstruction burden from the model's shallow layers [11][13]. Group 3: Performance Improvements - Engram has been scaled to 27 billion parameters, showing significant performance improvements over pure MoE baseline models under equivalent parameter and FLOPs conditions [11]. - Notably, Engram enhances knowledge retrieval capabilities, with improvements in metrics such as MMLU (+3.4), CMMLU (+4.0), and general reasoning tasks like BBH (+5.0) and ARC-Challenge (+3.7) [11][38]. Group 4: System Efficiency and Scalability - Engram's deterministic addressing supports prefetching from host memory at runtime with minimal performance overhead, allowing for efficient memory management [12][19]. - The architecture allows for the decoupling of parameter storage from computational resources, facilitating linear scalability with the number of accelerators [21][22]. Group 5: Experimental Results - Four models were trained: Dense-4B, MoE-27B, Engram-27B, and Engram-40B, all using the same training data and processes [35][36]. - Sparse architectures (MoE-27B, Engram-27B/40B) significantly outperformed the dense model (Dense-4B) across various benchmarks, demonstrating superior scaling properties [38]. Group 6: Long Context Training - Engram architecture has shown significant advantages in long-context tasks by preserving valuable attention capacity for global context processing [41]. - Controlled experiments indicate that Engram outperforms MoE models in complex retrieval tasks, confirming its architectural superiority [46].