STEM
Search documents
ICLR 2026|早于DeepSeek Engram,STEM已重构Transformer「记忆」
机器之心· 2026-03-09 02:50
Core Insights - The article discusses the evolution of parameter organization in large language models, emphasizing the need for more efficient memory representation methods [2] - It introduces STEM, a new approach that replaces the up-projection in the Feed Forward Network (FFN) with a token-indexed embedding table, allowing for static memory access without runtime routing [4][9] - The article highlights the significant improvements in model capabilities through structural changes rather than merely increasing scale or computational power [29][30] Summary by Sections - **Memory Organization**: The traditional method of storing knowledge in dense matrices has limitations in addressability and efficiency, prompting a shift towards more structured parameter organization [2][3] - **STEM Approach**: STEM directly modifies the FFN structure by using a static embedding table indexed by tokens, which simplifies memory access and enhances model performance [4][9] - **Key Insights of STEM**: - **Editability**: The explicit token-parameter relationship allows for direct modification of knowledge vectors without retraining, enabling easier knowledge editing [16][18] - **Training Stability**: STEM's static sparse structure avoids common issues found in dynamic routing systems, leading to improved training stability [20] - **Memory Efficiency**: The geometric structure of embeddings in STEM reduces interference between parameters, allowing for more addressable memory slots at lower computational costs [22][23] - **Computational Efficiency**: Removing up-projection saves significant computational resources, and large embedding tables can be offloaded to CPUs for efficient access [24] - **Experimental Results**: STEM was tested against dense baselines at model sizes of 350M and 1B, showing an average performance improvement of 3-4%, with some knowledge tasks improving by up to 9-10% [36]