Workflow
长上下文压缩
icon
Search documents
突破新领域 深度求索发布文字识别模型DeepSeek-OCR
Xin Jing Bao· 2025-10-21 03:11
DeepSeek还上传了与该模型相关的论文,在论文中,DeepSeek-OCR被描述为是"一项关于通过光学二维 映射来压缩长上下文可行性的初步研究。"实验表明,当文本标记数量在视觉标记数量的10倍以内时 (即压缩比 < 10倍),该模型可以达到97%的解码(OCR)精度。即使在20倍的压缩比下,OCR 准确 率仍能保持在约60%的水平。这对于长上下文压缩、大语言模型中的记忆遗忘机制等研究领域展现了相 当大的潜力。 (文章来源:新京报) 新京报贝壳财经讯(记者罗亦丹)北京时间10月20日,DeepSeek(深度求索)在开源社区Hugging Face 上发布了新模型DeepSeek-OCR。据了解,OCR(Optical Character Recognition,文字识别)模型是一种 用来从图像中提取文本的技术。 ...
突破新领域,深度求索发布文字识别模型DeepSeek-OCR
Bei Ke Cai Jing· 2025-10-20 12:37
Core Insights - DeepSeek has released a new model called DeepSeek-OCR on the open-source community platform Hugging Face, which is designed for Optical Character Recognition (OCR) to extract text from images [1][3] Group 1: Model Description - DeepSeek-OCR is described in a related paper as a preliminary study on the feasibility of compressing long contexts through optical two-dimensional mapping [3] - The model achieves a decoding (OCR) accuracy of 97% when the number of text tokens is within 10 times the number of visual tokens (compression ratio < 10) [3] - Even at a compression ratio of 20, the OCR accuracy remains around 60%, indicating significant potential for research areas such as long context compression and memory forgetting mechanisms in large language models [3]
太强了!DeepSeek刚刚开源新模型,用视觉方式压缩一切
机器之心· 2025-10-20 09:15
Core Insights - DeepSeek has released a new OCR model, DeepSeek-OCR, which demonstrates the potential for nearly 10x lossless contextual compression through text-to-image methods [1][3] - The model has a parameter count of 3 billion and has already seen over 100 downloads shortly after its release [1] - The research team behind DeepSeek-OCR includes Haoran Wei, Yaofeng Sun, and Yukun Li, with Wei having previously developed the GOT-OCR2.0 system [1] Model Architecture - DeepSeek-OCR consists of two main components: DeepEncoder and DeepSeek3B-MoE-A570M decoder [3][10] - DeepEncoder is designed to maintain low activation states under high-resolution inputs while achieving high compression ratios, generating a moderate number of visual tokens [3][14] - The model achieves an OCR accuracy of 97% when the number of text tokens is within 10 times the number of visual tokens, and maintains about 60% accuracy at a compression ratio of 20x [3][28] Performance and Practical Applications - In the OmniDocBench benchmark, DeepSeek-OCR outperformed GOT-OCR2.0 using only 100 visual tokens compared to 256 tokens for GOT-OCR2.0 [5] - The model can generate over 200,000 pages of LLM/VLM training data daily on a single A100-40G GPU [5] - DeepSeek-OCR shows strong practical capabilities, achieving superior performance compared to existing models like MinerU2.0 while using significantly fewer visual tokens [30][32] Training and Data - The training process for DeepSeek-OCR involves two main phases, utilizing a variety of OCR datasets and general visual data [21][24] - The model was trained using 20 nodes, each equipped with 8 A100-40G GPUs, achieving a global batch size of 640 [25] - The training speed reached 90 billion tokens per day for pure text data and 70 billion tokens per day for multimodal data [25] Compression and Recognition Capabilities - DeepSeek-OCR's method of using visual modalities as efficient compression media allows for significantly higher compression rates compared to traditional text representations [9][10] - The model supports recognition of nearly 100 languages, showcasing its versatility in processing diverse document types [42] - It can effectively parse complex layouts and extract structured data from charts, which is crucial for financial and scientific documents [35][40]