长上下文压缩
Search documents
突破新领域 深度求索发布文字识别模型DeepSeek-OCR
Xin Jing Bao· 2025-10-21 03:11
Core Insights - DeepSeek has released a new model called DeepSeek-OCR on the open-source community platform Hugging Face, which is designed for Optical Character Recognition (OCR) to extract text from images [1][4] Group 1: Model Description - DeepSeek-OCR is described in a related paper as a preliminary study on the feasibility of compressing long contexts through optical two-dimensional mapping [3] - The model achieves a decoding (OCR) accuracy of 97% when the number of text tokens is within 10 times the number of visual tokens (compression ratio < 10) [3] - Even at a compression ratio of 20, the OCR accuracy remains around 60%, indicating significant potential for research areas such as long context compression and memory forgetting mechanisms in large language models [3]
突破新领域,深度求索发布文字识别模型DeepSeek-OCR
Bei Ke Cai Jing· 2025-10-20 12:37
Core Insights - DeepSeek has released a new model called DeepSeek-OCR on the open-source community platform Hugging Face, which is designed for Optical Character Recognition (OCR) to extract text from images [1][3] Group 1: Model Description - DeepSeek-OCR is described in a related paper as a preliminary study on the feasibility of compressing long contexts through optical two-dimensional mapping [3] - The model achieves a decoding (OCR) accuracy of 97% when the number of text tokens is within 10 times the number of visual tokens (compression ratio < 10) [3] - Even at a compression ratio of 20, the OCR accuracy remains around 60%, indicating significant potential for research areas such as long context compression and memory forgetting mechanisms in large language models [3]
太强了!DeepSeek刚刚开源新模型,用视觉方式压缩一切
机器之心· 2025-10-20 09:15
Core Insights - DeepSeek has released a new OCR model, DeepSeek-OCR, which demonstrates the potential for nearly 10x lossless contextual compression through text-to-image methods [1][3] - The model has a parameter count of 3 billion and has already seen over 100 downloads shortly after its release [1] - The research team behind DeepSeek-OCR includes Haoran Wei, Yaofeng Sun, and Yukun Li, with Wei having previously developed the GOT-OCR2.0 system [1] Model Architecture - DeepSeek-OCR consists of two main components: DeepEncoder and DeepSeek3B-MoE-A570M decoder [3][10] - DeepEncoder is designed to maintain low activation states under high-resolution inputs while achieving high compression ratios, generating a moderate number of visual tokens [3][14] - The model achieves an OCR accuracy of 97% when the number of text tokens is within 10 times the number of visual tokens, and maintains about 60% accuracy at a compression ratio of 20x [3][28] Performance and Practical Applications - In the OmniDocBench benchmark, DeepSeek-OCR outperformed GOT-OCR2.0 using only 100 visual tokens compared to 256 tokens for GOT-OCR2.0 [5] - The model can generate over 200,000 pages of LLM/VLM training data daily on a single A100-40G GPU [5] - DeepSeek-OCR shows strong practical capabilities, achieving superior performance compared to existing models like MinerU2.0 while using significantly fewer visual tokens [30][32] Training and Data - The training process for DeepSeek-OCR involves two main phases, utilizing a variety of OCR datasets and general visual data [21][24] - The model was trained using 20 nodes, each equipped with 8 A100-40G GPUs, achieving a global batch size of 640 [25] - The training speed reached 90 billion tokens per day for pure text data and 70 billion tokens per day for multimodal data [25] Compression and Recognition Capabilities - DeepSeek-OCR's method of using visual modalities as efficient compression media allows for significantly higher compression rates compared to traditional text representations [9][10] - The model supports recognition of nearly 100 languages, showcasing its versatility in processing diverse document types [42] - It can effectively parse complex layouts and extract structured data from charts, which is crucial for financial and scientific documents [35][40]