ICLR 2026｜早于DeepSeek Engram，STEM已重构Transformer「记忆」

Core Insights - The article discusses the evolution of parameter organization in large language models, emphasizing the need for more efficient memory representation methods [2] - It introduces STEM, a new approach that replaces the up-projection in the Feed Forward Network (FFN) with a token-indexed embedding table, allowing for static memory access without runtime routing [4][9] - The article highlights the significant improvements in model capabilities through structural changes rather than merely increasing scale or computational power [29][30] Summary by Sections - Memory Organization: The traditional method of storing knowledge in dense matrices has limitations in addressability and efficiency, prompting a shift towards more structured parameter organization [2][3] - STEM Approach: STEM directly modifies the FFN structure by using a static embedding table indexed by tokens, which simplifies memory access and enhances model performance [4][9] - Key Insights of STEM: - Editability: The explicit token-parameter relationship allows for direct modification of knowledge vectors without retraining, enabling easier knowledge editing [16][18] - Training Stability: STEM's static sparse structure avoids common issues found in dynamic routing systems, leading to improved training stability [20] - Memory Efficiency: The geometric structure of embeddings in STEM reduces interference between parameters, allowing for more addressable memory slots at lower computational costs [22][23] - Computational Efficiency: Removing up-projection saves significant computational resources, and large embedding tables can be offloaded to CPUs for efficient access [24] - Experimental Results: STEM was tested against dense baselines at model sizes of 350M and 1B, showing an average performance improvement of 3-4%, with some knowledge tasks improving by up to 9-10% [36]