视觉压缩
Search documents
刚刚,DeepSeek 发布 OCR 2
程序员的那些事· 2026-01-27 15:40
转自: InfoQ 刚刚,DeepSeek 发布了 新模型 DeepSeek-OCR 2,采用创新的DeepEncoder V2方法,让AI能够根据图像的含义动态重排图像的各个部分,更接近人类的视觉编码逻 辑。在具体实现上,DeepSeek 团队在论文中称采用了 Qwen2-0.5B 来实例化这一架构。 而在 DeepSeek-OCR 2 中,这一思路被进一步推进。 根据技术报告,DeepEncoder V2 不再将视觉编码视为一次静态的、固定策略的扫描过程,而是引入了语义驱动的动态编码机制。模型会在编码阶段就开始判断哪些 区域更可能承载关键信息,并据此调整视觉 token 的分配与表达方式。 换句话说,视觉编码不再只是"预处理",而是已经提前进入了"理解阶段"。 和 DeepSeek 过往几乎所有重要发布一样,这一次依然选择了模型、代码与技术报告同时开源。项目、论文和模型权重已同步上线: 项目地址:https://github.com/deepseek-ai/DeepSeek-OCR-2 论文地址:https://github.com/deepseek-ai/DeepSeek-OCR-2/blob/main ...
与DeepSeek-OCR不谋而合,NeurIPS论文提出让LLM像人一样读长文本
机器之心· 2025-11-10 04:40
Core Insights - The article discusses a groundbreaking framework called VIST (Vision-centric Token Compression in LLM) developed by research teams from Nanjing University of Science and Technology, Central South University, and Nanjing Forestry University, aimed at enhancing long text reasoning in large language models (LLMs) through a visual approach [2][5]. Research Background - LLMs have shown remarkable capabilities in understanding and generating short texts, but they face challenges in processing long documents, complex question answering, and retrieval-augmented generation due to the increasing context length and model parameter size [4]. - The need for token compression has become essential as the scale of input data grows, making it difficult for even the most powerful LLMs to efficiently analyze vast amounts of information [4]. VIST Framework - VIST aims to address the challenges of long text processing by enabling models to read more like humans, utilizing a "slow-fast reading circuit" that mimics human reading strategies [7][8]. - The framework consists of two pathways: 1. Fast Path: Renders distant, less significant context as images for quick semantic extraction using a lightweight visual encoder. 2. Slow Path: Inputs key nearby text directly into the LLM for deep reasoning and language generation [8][15]. Visual Compression Mechanism - VIST employs a visual compression mechanism that allows models to efficiently process long texts by focusing on significant information while ignoring redundant words [22][23]. - The Probability-informed Visual Enhancement (PVE) mechanism teaches models to "skim read" by masking high-frequency, low-information words and retaining low-frequency, high-information words [22][23]. Performance Metrics - VIST demonstrates significant advantages over traditional text encoding methods, achieving a 56% reduction in the number of visual tokens required compared to conventional text tokens, and a 50% decrease in memory usage [10][25]. - In various tasks, VIST outperformed the CEPE method, showcasing its reliability in long text processing even under extreme conditions [25]. Visual Text Tokenization - VIST utilizes lightweight visual encoders for efficient context compression, simplifying the tokenization process and eliminating the need for complex preprocessing steps [28]. - The visual encoder's ability to handle multiple languages without being constrained by a vocabulary significantly reduces computational and memory overhead [29]. Future Implications - The integration of visual-driven token compression is expected to become a standard component in LLMs for long context understanding, paving the way for multimodal intelligent comprehension [32][33]. - This "look before reading" strategy will help large models maintain understanding capabilities while significantly lowering computational costs [33].