大语言模型长上下文问题
Search documents
重磅,DeepSeek再开源:视觉即压缩,100个token干翻7000个
3 6 Ke· 2025-10-21 01:35
Core Insights - DeepSeek-OCR model explores the boundaries of visual-text compression, demonstrating the ability to decode over ten times the amount of text information from a limited number of visual tokens, thus addressing long-context issues in large language models (LLMs) [1][16]. Group 1: Model Performance - In the OmniDocBench benchmark, DeepSeek-OCR surpasses GOT-OCR2.0 by using only 100 visual tokens compared to GOT-OCR2.0's 256 tokens per page [2][44]. - The model shows practical value in OCR tasks, achieving a compression ratio of 10.5× to 19.7× while maintaining high decoding accuracy, with precision rates around 97% within a 10× compression ratio [37][41]. Group 2: Technical Architecture - DeepSeek-OCR employs an end-to-end visual language model (VLM) architecture consisting of an encoder (DeepEncoder) and a decoder (DeepSeek-3B-MoE) [21][34]. - The encoder, DeepEncoder, utilizes a novel architecture with approximately 380 million parameters, combining SAM-base and CLIP-large for feature extraction and tokenization [23][24]. Group 3: Compression Capabilities - The model can achieve a compression ratio of 7 to 20 times in different historical contexts, providing a feasible direction for addressing long-context issues in LLMs [16]. - DeepSeek-OCR can generate training data for LLMs/VLMs at a rate of 33 million pages daily using 20 computing nodes, each equipped with 8 A100-40G GPUs [39]. Group 4: Multilingual and Application Scope - DeepSeek-OCR is capable of processing nearly 100 languages, enhancing its applicability in global contexts [43]. - The model can also interpret charts, chemical equations, simple geometric shapes, and natural images, showcasing its versatility in various document types [43][44].