视觉压缩
Search documents
刚刚,DeepSeek 发布 OCR 2
程序员的那些事· 2026-01-27 15:40
Core Viewpoint - DeepSeek has launched a new model, DeepSeek-OCR 2, which utilizes the innovative DeepEncoder V2 method to dynamically rearrange image components based on their meaning, aligning more closely with human visual encoding logic [1][3]. Group 1: Differences in DeepSeek-OCR 2 - Unlike traditional OCR systems that uniformly scan and encode images, DeepSeek-OCR 2 introduces a semantic-driven dynamic encoding mechanism that assesses which visual areas are most important during the encoding phase [3]. - The previous version, DeepSeek-OCR 1, was recognized for treating OCR as a visual compression problem, focusing on compressing visual content into a format more understandable for language models [3]. - DeepSeek-OCR 2 advances this concept by allowing visual encoding to enter the "understanding phase" rather than merely serving as a preprocessing step [4]. Group 2: Open Source and Accessibility - As with previous significant releases, DeepSeek-OCR 2 has been made open source, with the project, paper, and model weights all available simultaneously [4]. - The project can be accessed at the provided GitHub link, along with the technical report and model on Hugging Face [4].
与DeepSeek-OCR不谋而合,NeurIPS论文提出让LLM像人一样读长文本
机器之心· 2025-11-10 04:40
Core Insights - The article discusses a groundbreaking framework called VIST (Vision-centric Token Compression in LLM) developed by research teams from Nanjing University of Science and Technology, Central South University, and Nanjing Forestry University, aimed at enhancing long text reasoning in large language models (LLMs) through a visual approach [2][5]. Research Background - LLMs have shown remarkable capabilities in understanding and generating short texts, but they face challenges in processing long documents, complex question answering, and retrieval-augmented generation due to the increasing context length and model parameter size [4]. - The need for token compression has become essential as the scale of input data grows, making it difficult for even the most powerful LLMs to efficiently analyze vast amounts of information [4]. VIST Framework - VIST aims to address the challenges of long text processing by enabling models to read more like humans, utilizing a "slow-fast reading circuit" that mimics human reading strategies [7][8]. - The framework consists of two pathways: 1. Fast Path: Renders distant, less significant context as images for quick semantic extraction using a lightweight visual encoder. 2. Slow Path: Inputs key nearby text directly into the LLM for deep reasoning and language generation [8][15]. Visual Compression Mechanism - VIST employs a visual compression mechanism that allows models to efficiently process long texts by focusing on significant information while ignoring redundant words [22][23]. - The Probability-informed Visual Enhancement (PVE) mechanism teaches models to "skim read" by masking high-frequency, low-information words and retaining low-frequency, high-information words [22][23]. Performance Metrics - VIST demonstrates significant advantages over traditional text encoding methods, achieving a 56% reduction in the number of visual tokens required compared to conventional text tokens, and a 50% decrease in memory usage [10][25]. - In various tasks, VIST outperformed the CEPE method, showcasing its reliability in long text processing even under extreme conditions [25]. Visual Text Tokenization - VIST utilizes lightweight visual encoders for efficient context compression, simplifying the tokenization process and eliminating the need for complex preprocessing steps [28]. - The visual encoder's ability to handle multiple languages without being constrained by a vocabulary significantly reduces computational and memory overhead [29]. Future Implications - The integration of visual-driven token compression is expected to become a standard component in LLMs for long context understanding, paving the way for multimodal intelligent comprehension [32][33]. - This "look before reading" strategy will help large models maintain understanding capabilities while significantly lowering computational costs [33].