Workflow
长上下文建模
icon
Search documents
告别KV Cache枷锁,将长上下文压入权重,持续学习大模型有希望了?
机器之心· 2026-01-02 01:55
人类已经走上了创造 AGI(通用人工智能)的道路,而其中一个关键方面是持续学习,即 AI 能通过与环境互动而不断学习新的知识和能力。 想象一下你生命中的第一次机器学习讲座:你或许记不清教授开口说的第一个单词,但那场讲座留给你的直觉和逻辑,此刻正潜移默化地帮助你理解这篇复杂的 论文。这种能力的本质在于 压缩 。 近日,Astera 研究所、英伟达、斯坦福大学、加州大学伯克利分校、加州大学圣地亚哥分校的一个联合团队提出的 TTT-E2E(端到端测试时训练) 沿着这条 AGI 的必经之路迈出了重要一步。它彻底打破了传统模型在推理时静态不变的局限,让长上下文建模从一种「架构设计」进化为一种「学习问题」。 为此,研究社区已经在探索多种不同的道路,比如开发能够实时更新状态的循环神经网络(RNN),或者试图通过极大的缓存空间来容纳海量历史。然而,真正 的 AGI 或许不应仅仅被动地「存储」信息,而应像人类一样在阅读中「进化」。 该方法可以在测试阶段通过给定上下文的下一个 token 预测持续学习, 将读取的上下文信息压缩至权重参数中 。 编辑|Panda 论文标题:End-to-End Test-Time Training ...
用视觉压缩文本,清华、智谱推出Glyph框架:通过视觉-文本压缩扩展上下文窗口
3 6 Ke· 2025-10-21 23:10
Core Insights - Long-context modeling has emerged as a cutting-edge research trend in the large language model (LLM) industry, crucial for enhancing the productivity of LLMs [1] - The Glyph framework, developed by a research team from Tsinghua University and Z.ai, proposes a novel approach by rendering long texts as images, allowing for efficient processing through visual language models (VLMs) [1][3] Long Context LLMs - Long-context LLMs can achieve comprehensive semantic understanding and enhance multi-step reasoning and long-term memory capabilities, akin to human reading [1] - Traditional methods face limitations in practical applications due to increased computational and memory costs when extending context windows to millions of tokens [1] Glyph Framework - Glyph achieves 3-4 times token compression while maintaining accuracy comparable to leading models, significantly improving memory efficiency and training/inference speed [3][11] - For example, the classic novel "Jane Eyre" (approximately 240k text tokens) is rendered into a compact image (about 80k visual tokens), enabling a 128k context VLM to answer complex questions [3] Research Methodology - The Glyph framework consists of three main phases: continuous pre-training, LLM-driven rendering search, and post-training optimization [8][9][10] - Continuous pre-training involves rendering large-scale long text data into various visual styles to simulate real-world long text scenarios, enhancing cross-modal semantic alignment [8] - The LLM-driven rendering search optimizes rendering configurations to balance compression and understanding capabilities through a genetic search algorithm [9] - Post-training includes supervised fine-tuning and reinforcement learning to further enhance the model's text recognition and detail understanding abilities [10] Performance Evaluation - Glyph demonstrates competitive performance on multiple long-context benchmarks, achieving an average input compression rate of 3-4 times while maintaining accuracy similar to mainstream models [11][16] - In extreme compression scenarios, Glyph has the potential to handle million-token tasks using a 128k context length [17] Future Directions - The framework has limitations, such as sensitivity to rendering parameters and the need for improved OCR fidelity [21][22] - Future research may focus on adaptive rendering models, enhancing visual encoder capabilities, and expanding the evaluation scope to cover a wider range of tasks [23]
DeepSeek V4 借实习生获奖论文“起飞”?梁文峰剑指上下文:处理速度提10倍、要“完美”准确率
AI前线· 2025-07-31 05:02
Core Viewpoint - The article highlights the significant achievements of Chinese authors in the field of computational linguistics, particularly focusing on the award-winning paper from DeepSeek that introduces a novel sparse attention mechanism for long-context modeling, showcasing its efficiency and performance improvements over traditional methods [1][17]. Group 1: Award and Recognition - The ACL announced that over 51% of the award-winning papers for 2025 had Chinese authors, with the USA at 14% [1]. - A paper by DeepSeek, led by author Liang Wenfeng, won the Best Paper award, which has generated considerable discussion [1]. Group 2: Technical Innovations - The paper introduces a Natively Trainable Sparse Attention (NSA) mechanism, which combines algorithmic innovation with hardware optimization for efficient long-context modeling [4][6]. - NSA employs a dynamic hierarchical sparse strategy that balances global context awareness with local precision through token compression and selection [11]. Group 3: Performance Evaluation - NSA demonstrated superior performance in various benchmarks, outperforming traditional full attention models in 7 out of 9 metrics, particularly in long-context tasks [8][10]. - In a "needle in a haystack" test with 64k context, NSA achieved perfect retrieval accuracy and significant speed improvements in decoding and training processes [9][15]. Group 4: Future Implications - The upcoming DeepSeek model is expected to incorporate NSA technology, generating anticipation for its release [17]. - There are speculations regarding the delay of DeepSeek R2's release, attributed to the founder's dissatisfaction with its current performance [17].