Two-stream Attention Mechanism
Search documents
DeepSeek开源全新OCR模型!弃用CLIP改用Qwen轻量小模型,性能媲美Gemini-3 Pro
量子位· 2026-01-27 08:32
Core Insights - DeepSeek has released a new OCR model, DeepSeek-OCR 2, which focuses on accurately converting PDF documents to Markdown format [1] - The model's key breakthrough is the dynamic rearrangement of visual tokens based on image semantics, moving away from traditional raster scanning logic [2][3] - DeepSeek-OCR 2 achieves performance comparable to Gemini-3 Pro while utilizing a lightweight model [4] Model Architecture - DeepSeek-OCR 2 retains the classic architecture of its predecessor, consisting of an encoder and decoder working in tandem [10] - The encoder, now called DeepEncoder V2, replaces the previous CLIP component with a lightweight language model (Qwen2-0.5B), introducing causal reasoning capabilities [2][13] - This upgrade allows for intelligent rearrangement of visual tokens before they enter the main decoder, simulating human reading logic [3][15] Performance Metrics - On the OmniDocBench v1.5 benchmark, DeepSeek-OCR 2 achieved a performance score of 91.09%, representing a 3.73% improvement over the baseline [5][35] - The model's document parsing edit distance improved from 0.085 to 0.057, demonstrating the effectiveness of the visual information rearrangement [36] - In a similar token budget (1120), DeepSeek-OCR 2 outperformed Gemini-3 Pro in document parsing edit distance [37] Training and Evaluation - The training process for DeepSeek-OCR 2 follows a three-stage pipeline, focusing on semantic rearrangement and autoregressive inference [31] - The model was evaluated on a dataset comprising 1355 pages across various document types, ensuring a comprehensive assessment of its capabilities [33][34] - The model's design allows for a stable input token count between 256 and 1120, aligning with the visual budget of Gemini-1.5 Pro [27] Conclusion - DeepSeek-OCR 2 demonstrates significant advancements in OCR technology, validating the use of language model architecture as a visual encoder and paving the way for unified omni-modal encoders [39]