Workflow
DeepEncoder
icon
Search documents
DeepSeek的终极野心:把大语言模型的基本语言都改造成图像
3 6 Ke· 2025-10-21 12:52
Core Insights - DeepSeek has open-sourced DeepSeek-OCR, an OCR model that achieves state-of-the-art results on benchmarks like OmniDocBench [1] - The motivation behind entering the OCR field is to address the computational bottleneck of long context processing in large language models (LLMs) [4][6] - The paper proposes that text information can be efficiently compressed through optical 2D mapping, allowing visual language models (VLMs) to decompress original information from images [4][6] Group 1: Long Context Processing - The pursuit of longer context in LLMs has led to a competitive arms race, with token windows expanding from thousands to millions [7] - The core limitation arises from the attention mechanism in the Transformer architecture, where computational complexity and memory usage grow quadratically with sequence length [7] - DeepSeek-AI's engineers propose a fundamental question: can the number of tokens be compressed rather than just optimizing attention calculations? [7][10] Group 2: Visual Tokens vs. Text Tokens - Visual tokens are the basic units of information processed by visual models, while text tokens are used by LLMs [8] - A 1024x1024 image can be divided into 4096 visual tokens, significantly reducing the number of tokens needed compared to text representation [9] - The understanding that visual modalities can serve as efficient compression mediums for text information led to the creation of DeepSeek-OCR [9] Group 3: DeepEncoder and Compression Techniques - DeepSeek-OCR is essentially a proof of concept for an "optical compression-decompression" system [10] - The DeepEncoder, a key innovation, is designed to handle high-resolution inputs while producing minimal visual tokens [11][12] - The architecture consists of three stages: a local detail processor, a compression module, and a global attention layer [14][16] Group 4: Performance Metrics - Experimental results show a 10.5x compression rate with 64 visual tokens decoding 600-700 text tokens, achieving an OCR accuracy of 96.5% [17][18] - At a 20x compression rate, the model maintains around 60% accuracy while decoding over 1200 text tokens [17][18] - DeepSeek-OCR outperforms existing models like GOT-OCR2.0 and MinerU2.0 in terms of performance and token efficiency [19][20] Group 5: Future Vision and Memory Simulation - The team aims to simulate human memory's forgetting mechanism, which naturally prioritizes relevant information while compressing less important details [25][27] - The multi-resolution design of DeepSeek-OCR provides a technical foundation for managing memory in a way that mimics human cognitive processes [29][30] - The ultimate goal is to create a system that balances information retention and computational efficiency, potentially leading to a new paradigm in AI memory and input systems [32][35]
重磅,DeepSeek再开源:视觉即压缩,100个token干翻7000个
3 6 Ke· 2025-10-21 01:35
一图胜千言!DeepSeek-OCR模型大胆探索视觉-文本压缩边界。通过少量视觉token解码出10倍以上的文本信息,这款端到端VLM架构不仅在 OmniDocBench基准上碾压GOT-OCR2.0,还为LLM的长上下文问题提供高效解决方案。 DeepSeek再发新模型! Github上,DeepSeek新建了DeepSeek-OCR仓库,目的是探索视觉-文本压缩的边界。 常言道:一图胜万言。对LLM也是如此! 在理论上,DeepSeek-OCR模型初步验证了「上下文光学压缩」的可行性—— 从少量视觉token中,模型能够有效解码出超过其数量10倍的文本token。 也就是说,包含文档文本的单张图像,能以远少于等效文本的token量来表征丰富信息。 这表明通过视觉token进行光学压缩可以实现更高的压缩比。 作为连接视觉与语言的中间模态,OCR任务是视觉-文本压缩范式理想的试验场—— 它在视觉与文本表征之间建立了天然的压缩-解压缩映射关系,同时提供可量化的评估指标。 在OCR任务上,DeepSeek-OCR有较高实用价值:在OmniDocBench基准测试中,仅用100个视觉token即超越GOT-OCR2 ...
DeepSeek团队发布新型视觉压缩模型DeepSeek-OCR
智通财经网· 2025-10-20 11:37
Core Insights - DeepSeek-AI team has launched a new research achievement called DeepSeek-OCR, which innovatively compresses long text context into visual tokens, significantly reducing the number of tokens needed for processing [1] - The system consists of two main components: DeepEncoder, designed for high-resolution input with low computational activation, and DeepSeek3B-MoE-A570M as the decoder [1] - Experimental results show that when the number of text tokens does not exceed ten times the number of visual tokens (compression ratio below 10x), the model achieves an OCR accuracy of 97%, and even at a 20x compression ratio, the accuracy remains around 60% [1] Performance Metrics - In the OmniDocBench test, DeepSeek-OCR surpassed the performance of GOT-OCR2.0, using only 100 visual tokens compared to 256 tokens per page for GOT-OCR2.0 [2] - DeepSeek-OCR also outperformed MinerU2.0, which uses an average of over 6000 tokens per page, by utilizing less than 800 visual tokens [2] - The system can generate over 200,000 pages of training data for large language models/visual language models daily on a single A100-40G GPU [2]