Workflow
Visual - Text Compression (VTC)
icon
Search documents
DeepSeek-OCR是「长文本理解」未来方向?中科院新基准VTCBench给出答案
机器之心· 2026-01-10 04:06
Core Insights - DeepSeek-OCR's Vision-Text Compression (VTC) technology achieves a compression rate of up to 10 times, significantly reducing the cost of processing long texts with large models [2][7] - The introduction of VTCBench, a benchmark test developed by research teams from institutions like the Chinese Academy of Sciences, aims to evaluate the cognitive limits of models in visual space through tasks such as information retrieval, associative reasoning, and long-term memory [2][10] VTC Technology Overview - VTC paradigm transforms long documents into high-density 2D images, which are then converted into a limited number of visual tokens by a visual encoder, differing from traditional models that read thousands of pure text tokens [6] - The technology can achieve a token compression rate between 2 to 10 times, significantly lowering computational and memory costs during long text processing [7] VTCBench Benchmark - VTCBench systematically evaluates models' cognitive limits in visual space through three main tasks: 1. VTC-Retrieval: Tests the model's ability to find specific facts in a vast visual context [10] 2. VTC-Reasoning: Challenges the model to find facts through associative reasoning with minimal text overlap [10] 3. VTC-Memory: Simulates long dialogues to assess the model's ability to resist decay of temporal and structural information [10] VTCBench-Wild - VTCBench-Wild has been introduced to assess the robustness of models in complex real-world scenarios, incorporating 99 different rendering configurations [11] Cognitive Bottlenecks - Current visual language models (VLMs) may excel at OCR recognition, but their understanding of high-density information from VTC-compressed texts remains questionable [9] - Testing results show a significant "U-shaped curve" in model performance, indicating that while models can capture information at the beginning and end of documents, their understanding of facts in the middle deteriorates as document length increases [14][15] Industry Insights - Despite the efficiency gains from VTC, existing VLMs still perform significantly worse than pure text LLMs in complex reasoning and memory tasks [17] - The performance of models like Gemini-3-Pro in VTCBench-Wild demonstrates that VTC is a highly feasible path for large-scale long text processing, with its visual understanding capabilities nearly matching pure text benchmarks [17][18]