DeepOCR
Search documents
两周复刻DeepSeek-OCR,两人小团队还原低token高压缩核心,换完解码器更实用
3 6 Ke· 2025-11-07 07:11
Core Insights - A two-person team has successfully replicated the previously acclaimed DeepSeek-OCR in just two weeks, naming their version DeepOCR, which retains the original's low token count and high compression advantages while matching its performance on key tasks [1][3]. Technology and Design - DeepSeek-OCR's design philosophy focuses on "visual compression," using a small number of visual tokens to represent content that would typically require a large number of text tokens, thus reducing computational costs and addressing the challenges of processing long texts with large models [3][4]. - The core strategy of the two-person team was to accurately replicate the original's logical architecture, particularly the DeepEncoder encoder, which follows a three-stage structure: local processing, compression, and global understanding [6][9]. - The first stage involves processing high-resolution images into patches, controlling memory activation to avoid overload, followed by a compression stage that reduces the number of tokens while increasing feature dimensions, and finally, a global attention stage that captures document semantics without causing memory issues [6][7]. Performance Metrics - The DeepOCR version uses approximately 250 visual tokens, which, while slightly less efficient than the DeepSeek-OCR Base version, is significantly more efficient than baseline models like Qwen2.5-VL-7B, which require 3,949 tokens for similar performance [15]. - In foundational tasks, DeepOCR excels in English text recognition and table parsing, with table parsing even outperforming the original version due to precise replication of the original's 2D spatial encoding [15][17]. Training Methodology - The team employed a two-stage training process, freezing the DeepEncoder throughout, which significantly reduced memory requirements. The first stage trained a multi-modal projector, while the second stage involved pre-training the entire model [13][18]. - The training setup was designed to be compatible with the resources of small to medium teams, utilizing two H200 GPUs [13]. Future Developments - The team plans to enhance the model by incorporating additional data types such as formulas, multilingual support, and old scans, as well as experimenting with techniques like dynamic temperature scaling and RLVR to further narrow performance gaps [18]. Team Background - The team consists of Ming Liu, who has a background in applied physics and is currently pursuing a PhD in computer science, and Liu Shilong, who holds degrees in engineering and computer science and is a postdoctoral researcher at Princeton University [19][20].
两周复刻DeepSeek-OCR!两人小团队还原低token高压缩核心,换完解码器更实用
量子位· 2025-11-07 05:32
Core Insights - The article discusses the development of DeepOCR, a replica of the previously acclaimed DeepSeek-OCR, achieved by a small team in just two weeks, maintaining the original's advantages of low token usage and high compression [1][5]. Group 1: Technology and Design - DeepSeek-OCR's design philosophy focuses on "visual compression," using a limited number of visual tokens to represent content that would typically require many text tokens, thus reducing computational costs associated with large models [4][6]. - The model achieves a compression ratio of 7-20 times, maintaining an accuracy of 97% even with a 10-fold compression [7]. - The architecture of DeepSeek-OCR includes a three-stage structure: local processing, compression, and global understanding, which helps manage memory usage effectively [10]. Group 2: Training and Performance - DeepOCR is designed to be low-computationally intensive, allowing it to be trained on just two H200 GPUs, making it accessible for small teams [21]. - The training process consists of two phases, with the first phase focusing on training a multi-modal projector while keeping the DeepEncoder frozen, significantly reducing memory requirements [20]. - In practical tests, DeepOCR uses approximately 250 visual tokens, which, while slightly less efficient than the original DeepSeek-OCR, is still significantly better than baseline models that require thousands of tokens for similar performance [22]. Group 3: Results and Future Plans - DeepOCR shows strong performance in basic tasks such as English text recognition and table parsing, with table parsing even outperforming the original model due to precise restoration of the original 2D spatial encoding [24]. - The team plans to enhance the model by incorporating additional data types, including formulas and multi-language support, and exploring advanced techniques to further improve performance [28]. - The article highlights the team's academic backgrounds, showcasing their expertise in multi-modal fields and previous experience in notable tech companies [29][31].