DeepSeek-OCR 2重磅发布:AI学会“人类视觉逻辑”,以因果流解读图片
华尔街见闻·2026-01-27 09:56

Core Viewpoint - DeepSeek has launched the DeepSeek-OCR 2 system, which utilizes the DeepEncoder V2 method to enable AI to understand images in a human-like logical sequence, potentially transforming document processing and complex visual understanding applications [1][12]. Group 1: Technical Innovations - The DeepEncoder V2 method allows AI to dynamically rearrange image segments based on their meaning, rather than following a rigid left-to-right scanning approach, mimicking human visual perception [1][5]. - DeepSeek-OCR 2 achieved a score of 91.09% in the OmniDocBench v1.5 benchmark, representing a 3.73% improvement over its predecessor [1][10]. - The model maintains high accuracy while controlling computational costs, with visual token counts limited to between 256 and 1120, aligning with Google’s Gemini-3 Pro [2][8]. Group 2: Performance Metrics - In practical applications, the model demonstrated a reduction in repetition rates, decreasing from 6.25% to 4.17% for online user logs and from 3.69% to 2.88% for PDF data processing, indicating its high practical maturity [2][10]. - The reading order edit distance metric improved significantly from 0.085 to 0.057, validating the effectiveness of the logical reordering capabilities of DeepEncoder V2 [10]. Group 3: Architectural Changes - The architecture of DeepEncoder V2 replaced the original CLIP components with a compact LLM-style architecture (Qwen2-0.5B), introducing learnable query vectors known as "causal flow tokens" [6][8]. - The design retains a bidirectional attention mechanism for visual tokens while employing a causal attention mechanism for causal flow tokens, allowing for intelligent reordering of visual information [7][8]. Group 4: Future Implications - The release of DeepSeek-OCR 2 signifies not only an upgrade in OCR performance but also a significant exploration of architecture, suggesting a promising path towards unified multimodal encoders capable of feature extraction across images, audio, and text [12].

DeepSeek-OCR 2重磅发布:AI学会“人类视觉逻辑”,以因果流解读图片 - Reportify