DeepSeek发布DeepSeek-OCR 2 让AI学会“人类视觉逻辑”

Core Insights - DeepSeek has launched the new DeepSeek-OCR2 model, which utilizes the innovative DeepEncoder V2 method to dynamically rearrange image components based on their meaning, enhancing visual understanding beyond traditional left-to-right scanning methods [1][2] - The model significantly outperforms traditional visual-language models (VLM) in processing complex layouts, achieving a score of 91.09% on the OmniDocBench v1.5 benchmark, which is a 3.73% improvement over its predecessor [1] Group 1 - The DeepSeek-OCR2 model maintains high accuracy while controlling computational costs, with visual token counts limited between 256 and 1120, aligning with Google’s Gemini-3Pro [2] - In practical applications, the model shows a reduction in repetition rates of 2.08% for online user logs and 0.81% for PDF pre-training data, indicating high practical maturity [2] Group 2 - The release of DeepSeek-OCR2 represents not only an upgrade in OCR performance but also significant architectural exploration, validating the potential of using language model architectures as visual encoders [2] - The DeepEncoder V2 architecture inherits advancements from the LLM community, such as mixture of experts (MoE) architecture and efficient attention mechanisms [2]