Seek .-梁文锋署名新论文，DeepSeek V4架构首曝？直击Transformer致命缺陷

Core Insights - DeepSeek's new paper introduces a novel approach to address the memory limitations of Transformer models by proposing a complementary "conditional memory" sparse axis through the Engram module, which enables efficient knowledge retrieval with near O(1) complexity [1][6][11]. Group 1: Memory and Model Architecture - The paper highlights that while MoE (Mixture of Experts) has become a mainstream architecture for large models, it fundamentally still relies on Transformers, which lack a native knowledge retrieval mechanism, leading to inefficient computation [9][11]. - Engram is designed to offload static, repetitive patterns in language modeling to a scalable lookup module, allowing the Transformer backbone to focus on more complex tasks requiring combination and reasoning [11][15]. - The authors categorize language modeling tasks into two types: those requiring combination and reasoning, and those resembling pattern retrieval, emphasizing the need for a dedicated mechanism for the latter [12][13]. Group 2: Engram Architecture and Functionality - Engram is conceptualized as a modernized version of classic hash N-gram, functioning as a scalable lookup module integrated within the Transformer architecture [18][20]. - The architecture includes a two-stage process for handling input sequences, focusing on retrieval and fusion, which enhances the model's efficiency in processing static patterns [20][21]. - The introduction of a context-aware gating mechanism allows the model to dynamically adjust its responses based on the retrieved embeddings, improving the overall expressiveness and reducing noise from hash collisions [25][27]. Group 3: Performance and Scaling - The paper presents a U-shaped scaling law indicating that an optimal resource allocation between MoE and Engram can enhance model performance, suggesting that a balance between dynamic computation and static memory is crucial [3][33]. - Experimental results show that Engram, when scaled to 27 billion parameters, outperforms the MoE baseline under equivalent parameter and FLOPs conditions, demonstrating its effectiveness in various benchmarks [5][38]. - Engram's architecture not only improves knowledge retrieval but also enhances reasoning, mathematics, and coding capabilities, indicating a significant leap in performance metrics across multiple tasks [39][48]. Group 4: Future Implications - The findings suggest a paradigm shift in model architecture towards a dual-axis approach of computation and memory, with potential integration into future iterations of large language models, such as V4 [46][50]. - The paper posits that the integration of Engram could lead to substantial improvements in model efficiency and capability, paving the way for more advanced applications in natural language processing [51][52].