DeepSeek-OCR2：以“因果阅读顺序”重塑复杂文档理解

Investment Rating - The report does not explicitly provide an investment rating for the industry or specific companies involved in the DeepSeek-OCR 2 development. Core Insights - DeepSeek-OCR 2 represents a significant advancement in document understanding technology, particularly in handling complex layouts by utilizing a new visual encoder, DeepEncoder V2, which enhances the model's ability to parse text, tables, and formulas more accurately and efficiently [12][14]. - The model has achieved a score of 91.09% on the OmniDocBench v1.5 benchmark, indicating it has entered the top tier of document understanding models, with a notable improvement in reading order accuracy [14]. - The model's efficiency allows it to process complex documents with only 256 to 1120 visual tokens, significantly reducing computational load and latency for downstream applications [15]. Summary by Sections Model Upgrade and Features - The DeepSeek-OCR 2 model introduces a lightweight language model, Qwen2-500M, and a "causal flow query" mechanism that reorganizes visual tokens based on content logic, improving semantic continuity and recognition accuracy [13][14]. - The model's architecture allows for a more human-like understanding of document structure, which is crucial for processing complex documents like multi-column layouts and nested tables [12][13]. Performance Metrics - DeepSeek-OCR 2's edit-distance metric improved from 0.085 to 0.057, validating its structure-first reading approach [14]. - Compared to competitors, DeepSeek-OCR 2's performance is approaching industry leaders, with a document-parsing edit distance of 0.100, outperforming Gemini 3 Pro [14]. Real-World Applications - The model's open-source nature and moderate parameter size (3 billion) facilitate its integration into existing enterprise workflows, with potential applications in PDF-to-Markdown conversion and structured data extraction [15]. - Feedback from production environments indicates a significant reduction in text duplication rates, suggesting improved reliability in practical applications [15]. Long-Term Vision - The development of DeepSeek-OCR 2 is seen as an exploration of architectural innovation, aiming to enhance the capabilities of vision-language models and improve the generation of structured training data for large language models [16]. - The team has outlined clear iterative directions for future improvements, focusing on enhancing performance for text-dense documents [16].