告别KV Cache枷锁，将长上下文压入权重，持续学习大模型有希望了？

Core Viewpoint - The article discusses the development of AGI (Artificial General Intelligence) and emphasizes the importance of continuous learning, where AI can learn new knowledge and skills through interaction with the environment [1]. Group 1: TTT-E2E Development - A collaborative team from Astera, NVIDIA, Stanford University, UC Berkeley, and UC San Diego has proposed TTT-E2E (End-to-End Test-Time Training), which represents a significant step towards AGI by transforming long context modeling from an architectural design into a learning problem [2]. - TTT-E2E aims to overcome the limitations of traditional models that remain static during inference, allowing for dynamic learning during the testing phase [9][10]. Group 2: Challenges in Long Context Modeling - The article highlights the dilemma in long context modeling, where the full attention mechanism of Transformers performs well on long texts but incurs significant inference costs as the length increases [5]. - Alternatives like RNNs and state space models (SSM) have constant per-token computation costs but often suffer performance declines when handling very long texts [5][6]. Group 3: TTT-E2E Mechanism - TTT-E2E defines the model's behavior during testing as an online optimization process, allowing the model to perform self-supervised learning on already read tokens before predicting the next token [11]. - The approach incorporates meta-learning to optimize model initialization parameters, enabling the model to learn how to learn effectively [13]. - A hybrid architecture combines a sliding window attention mechanism for short-term memory with a dynamically updated MLP layer for long-term memory, mimicking biological memory systems [13][14]. Group 4: Experimental Results - Experimental results demonstrate that TTT-E2E exhibits performance scalability comparable to full attention Transformers, maintaining a consistent loss function even as context length increases from 8K to 128K [21]. - In terms of inference efficiency, TTT-E2E shows a significant advantage, processing speed at 128K context is 2.7 times faster than full attention Transformers [22]. Group 5: Future Implications - TTT-E2E signifies a shift from static models to dynamic individuals, where the process of handling long documents is akin to a micro self-evolution [27]. - This "compute-for-storage" approach envisions a future where models can continuously adjust themselves while processing vast amounts of information, potentially encapsulating human civilization's history within their parameters without hardware limitations [29].