Workflow
多模态智能理解
icon
Search documents
与DeepSeek-OCR不谋而合,NeurIPS论文提出让LLM像人一样读长文本
机器之心· 2025-11-10 04:40
Core Insights - The article discusses a groundbreaking framework called VIST (Vision-centric Token Compression in LLM) developed by research teams from Nanjing University of Science and Technology, Central South University, and Nanjing Forestry University, aimed at enhancing long text reasoning in large language models (LLMs) through a visual approach [2][5]. Research Background - LLMs have shown remarkable capabilities in understanding and generating short texts, but they face challenges in processing long documents, complex question answering, and retrieval-augmented generation due to the increasing context length and model parameter size [4]. - The need for token compression has become essential as the scale of input data grows, making it difficult for even the most powerful LLMs to efficiently analyze vast amounts of information [4]. VIST Framework - VIST aims to address the challenges of long text processing by enabling models to read more like humans, utilizing a "slow-fast reading circuit" that mimics human reading strategies [7][8]. - The framework consists of two pathways: 1. Fast Path: Renders distant, less significant context as images for quick semantic extraction using a lightweight visual encoder. 2. Slow Path: Inputs key nearby text directly into the LLM for deep reasoning and language generation [8][15]. Visual Compression Mechanism - VIST employs a visual compression mechanism that allows models to efficiently process long texts by focusing on significant information while ignoring redundant words [22][23]. - The Probability-informed Visual Enhancement (PVE) mechanism teaches models to "skim read" by masking high-frequency, low-information words and retaining low-frequency, high-information words [22][23]. Performance Metrics - VIST demonstrates significant advantages over traditional text encoding methods, achieving a 56% reduction in the number of visual tokens required compared to conventional text tokens, and a 50% decrease in memory usage [10][25]. - In various tasks, VIST outperformed the CEPE method, showcasing its reliability in long text processing even under extreme conditions [25]. Visual Text Tokenization - VIST utilizes lightweight visual encoders for efficient context compression, simplifying the tokenization process and eliminating the need for complex preprocessing steps [28]. - The visual encoder's ability to handle multiple languages without being constrained by a vocabulary significantly reduces computational and memory overhead [29]. Future Implications - The integration of visual-driven token compression is expected to become a standard component in LLMs for long context understanding, paving the way for multimodal intelligent comprehension [32][33]. - This "look before reading" strategy will help large models maintain understanding capabilities while significantly lowering computational costs [33].