两周复刻DeepSeek-OCR!两人小团队还原低token高压缩核心,换完解码器更实用
量子位·2025-11-07 05:32

Core Insights - The article discusses the development of DeepOCR, a replica of the previously acclaimed DeepSeek-OCR, achieved by a small team in just two weeks, maintaining the original's advantages of low token usage and high compression [1][5]. Group 1: Technology and Design - DeepSeek-OCR's design philosophy focuses on "visual compression," using a limited number of visual tokens to represent content that would typically require many text tokens, thus reducing computational costs associated with large models [4][6]. - The model achieves a compression ratio of 7-20 times, maintaining an accuracy of 97% even with a 10-fold compression [7]. - The architecture of DeepSeek-OCR includes a three-stage structure: local processing, compression, and global understanding, which helps manage memory usage effectively [10]. Group 2: Training and Performance - DeepOCR is designed to be low-computationally intensive, allowing it to be trained on just two H200 GPUs, making it accessible for small teams [21]. - The training process consists of two phases, with the first phase focusing on training a multi-modal projector while keeping the DeepEncoder frozen, significantly reducing memory requirements [20]. - In practical tests, DeepOCR uses approximately 250 visual tokens, which, while slightly less efficient than the original DeepSeek-OCR, is still significantly better than baseline models that require thousands of tokens for similar performance [22]. Group 3: Results and Future Plans - DeepOCR shows strong performance in basic tasks such as English text recognition and table parsing, with table parsing even outperforming the original model due to precise restoration of the original 2D spatial encoding [24]. - The team plans to enhance the model by incorporating additional data types, including formulas and multi-language support, and exploring advanced techniques to further improve performance [28]. - The article highlights the team's academic backgrounds, showcasing their expertise in multi-modal fields and previous experience in notable tech companies [29][31].

两周复刻DeepSeek-OCR!两人小团队还原低token高压缩核心,换完解码器更实用 - Reportify