DeepSeek V4诞生前夜？梁文锋署名新论文发布

Core Viewpoint - The article discusses a groundbreaking paper by DeepSeek and Peking University that introduces a new module called Engram, which separates memory from computation in AI models, leading to a significant increase in reasoning capabilities [3][12]. Group 1: Introduction of Engram Module - DeepSeek's Engram module represents a supply-side reform in AI model architecture, allowing static knowledge to be stored separately from computational tasks, thus enhancing AI's reasoning abilities [3][14]. - The Engram module is inspired by the classic N-gram concept from natural language processing, modernized to allow for efficient retrieval of static knowledge with a time complexity of O(1) [15][16]. Group 2: Technical Innovations - Engram utilizes a large, scalable embedding table to store static knowledge, allowing for direct retrieval without complex computations, contrasting with traditional Transformer models where knowledge is embedded in weights [18]. - Three technical barriers were addressed: - A. Vocabulary compression reduced the effective vocabulary size by 23% through normalization of semantically similar terms [19]. - B. Multi-head hashing resolves hash collisions by mapping multiple N-grams to limited memory slots, enhancing robustness [20]. - C. Context-aware gating acts as a referee to filter out irrelevant static knowledge based on the current context [21][22]. Group 3: Resource Allocation and Model Performance - A large-scale ablation study revealed a U-shaped scaling law for resource allocation, indicating that the optimal distribution of parameters is approximately 75%-80% for Engram and 20%-25% for MoE, minimizing loss [30][31]. - The introduction of Engram not only improved knowledge tasks but also unexpectedly enhanced performance in logic, coding, and mathematics, with significant score increases across various benchmarks [39][40]. Group 4: Engineering Breakthroughs - Engram's architecture allows for a separation of memory and computation, enabling large models to offload memory to cheaper, scalable CPU resources, thus reducing reliance on expensive GPU memory [46][49]. - This separation allows for prefetching of memory data, maintaining high throughput even with large parameter sizes, which is a significant advantage for future AI model development [51][52]. Group 5: Future Implications - The upcoming DeepSeek V4 model is expected to integrate Engram technology, achieving a balance between computation and memory, enhancing both knowledge capacity and reasoning capabilities while reducing inference costs [61][64]. - The paper signals a shift in the AI industry towards architectural innovation, moving away from merely increasing computational power and parameters, and redefining competitive standards in AI development [65].