Workflow
长上下文建模
icon
Search documents
告别KV Cache枷锁,将长上下文压入权重,持续学习大模型有希望了?
机器之心· 2026-01-02 01:55
Core Viewpoint - The article discusses the development of AGI (Artificial General Intelligence) and emphasizes the importance of continuous learning, where AI can learn new knowledge and skills through interaction with the environment [1]. Group 1: TTT-E2E Development - A collaborative team from Astera, NVIDIA, Stanford University, UC Berkeley, and UC San Diego has proposed TTT-E2E (End-to-End Test-Time Training), which represents a significant step towards AGI by transforming long context modeling from an architectural design into a learning problem [2]. - TTT-E2E aims to overcome the limitations of traditional models that remain static during inference, allowing for dynamic learning during the testing phase [9][10]. Group 2: Challenges in Long Context Modeling - The article highlights the dilemma in long context modeling, where the full attention mechanism of Transformers performs well on long texts but incurs significant inference costs as the length increases [5]. - Alternatives like RNNs and state space models (SSM) have constant per-token computation costs but often suffer performance declines when handling very long texts [5][6]. Group 3: TTT-E2E Mechanism - TTT-E2E defines the model's behavior during testing as an online optimization process, allowing the model to perform self-supervised learning on already read tokens before predicting the next token [11]. - The approach incorporates meta-learning to optimize model initialization parameters, enabling the model to learn how to learn effectively [13]. - A hybrid architecture combines a sliding window attention mechanism for short-term memory with a dynamically updated MLP layer for long-term memory, mimicking biological memory systems [13][14]. Group 4: Experimental Results - Experimental results demonstrate that TTT-E2E exhibits performance scalability comparable to full attention Transformers, maintaining a consistent loss function even as context length increases from 8K to 128K [21]. - In terms of inference efficiency, TTT-E2E shows a significant advantage, processing speed at 128K context is 2.7 times faster than full attention Transformers [22]. Group 5: Future Implications - TTT-E2E signifies a shift from static models to dynamic individuals, where the process of handling long documents is akin to a micro self-evolution [27]. - This "compute-for-storage" approach envisions a future where models can continuously adjust themselves while processing vast amounts of information, potentially encapsulating human civilization's history within their parameters without hardware limitations [29].
用视觉压缩文本,清华、智谱推出Glyph框架:通过视觉-文本压缩扩展上下文窗口
3 6 Ke· 2025-10-21 23:10
Core Insights - Long-context modeling has emerged as a cutting-edge research trend in the large language model (LLM) industry, crucial for enhancing the productivity of LLMs [1] - The Glyph framework, developed by a research team from Tsinghua University and Z.ai, proposes a novel approach by rendering long texts as images, allowing for efficient processing through visual language models (VLMs) [1][3] Long Context LLMs - Long-context LLMs can achieve comprehensive semantic understanding and enhance multi-step reasoning and long-term memory capabilities, akin to human reading [1] - Traditional methods face limitations in practical applications due to increased computational and memory costs when extending context windows to millions of tokens [1] Glyph Framework - Glyph achieves 3-4 times token compression while maintaining accuracy comparable to leading models, significantly improving memory efficiency and training/inference speed [3][11] - For example, the classic novel "Jane Eyre" (approximately 240k text tokens) is rendered into a compact image (about 80k visual tokens), enabling a 128k context VLM to answer complex questions [3] Research Methodology - The Glyph framework consists of three main phases: continuous pre-training, LLM-driven rendering search, and post-training optimization [8][9][10] - Continuous pre-training involves rendering large-scale long text data into various visual styles to simulate real-world long text scenarios, enhancing cross-modal semantic alignment [8] - The LLM-driven rendering search optimizes rendering configurations to balance compression and understanding capabilities through a genetic search algorithm [9] - Post-training includes supervised fine-tuning and reinforcement learning to further enhance the model's text recognition and detail understanding abilities [10] Performance Evaluation - Glyph demonstrates competitive performance on multiple long-context benchmarks, achieving an average input compression rate of 3-4 times while maintaining accuracy similar to mainstream models [11][16] - In extreme compression scenarios, Glyph has the potential to handle million-token tasks using a 128k context length [17] Future Directions - The framework has limitations, such as sensitivity to rendering parameters and the need for improved OCR fidelity [21][22] - Future research may focus on adaptive rendering models, enhancing visual encoder capabilities, and expanding the evaluation scope to cover a wider range of tasks [23]
DeepSeek V4 借实习生获奖论文“起飞”?梁文峰剑指上下文:处理速度提10倍、要“完美”准确率
AI前线· 2025-07-31 05:02
Core Viewpoint - The article highlights the significant achievements of Chinese authors in the field of computational linguistics, particularly focusing on the award-winning paper from DeepSeek that introduces a novel sparse attention mechanism for long-context modeling, showcasing its efficiency and performance improvements over traditional methods [1][17]. Group 1: Award and Recognition - The ACL announced that over 51% of the award-winning papers for 2025 had Chinese authors, with the USA at 14% [1]. - A paper by DeepSeek, led by author Liang Wenfeng, won the Best Paper award, which has generated considerable discussion [1]. Group 2: Technical Innovations - The paper introduces a Natively Trainable Sparse Attention (NSA) mechanism, which combines algorithmic innovation with hardware optimization for efficient long-context modeling [4][6]. - NSA employs a dynamic hierarchical sparse strategy that balances global context awareness with local precision through token compression and selection [11]. Group 3: Performance Evaluation - NSA demonstrated superior performance in various benchmarks, outperforming traditional full attention models in 7 out of 9 metrics, particularly in long-context tasks [8][10]. - In a "needle in a haystack" test with 64k context, NSA achieved perfect retrieval accuracy and significant speed improvements in decoding and training processes [9][15]. Group 4: Future Implications - The upcoming DeepSeek model is expected to incorporate NSA technology, generating anticipation for its release [17]. - There are speculations regarding the delay of DeepSeek R2's release, attributed to the founder's dissatisfaction with its current performance [17].