Workflow
文档理解
icon
Search documents
DeepSeek-OCR2:以“因果阅读顺序”重塑复杂文档理解
Investment Rating - The report does not explicitly provide an investment rating for the industry or specific companies involved in the DeepSeek-OCR 2 development. Core Insights - DeepSeek-OCR 2 represents a significant advancement in document understanding technology, particularly in handling complex layouts by utilizing a new visual encoder, DeepEncoder V2, which enhances the model's ability to parse text, tables, and formulas more accurately and efficiently [12][14]. - The model has achieved a score of 91.09% on the OmniDocBench v1.5 benchmark, indicating it has entered the top tier of document understanding models, with a notable improvement in reading order accuracy [14]. - The model's efficiency allows it to process complex documents with only 256 to 1120 visual tokens, significantly reducing computational load and latency for downstream applications [15]. Summary by Sections Model Upgrade and Features - The DeepSeek-OCR 2 model introduces a lightweight language model, Qwen2-500M, and a "causal flow query" mechanism that reorganizes visual tokens based on content logic, improving semantic continuity and recognition accuracy [13][14]. - The model's architecture allows for a more human-like understanding of document structure, which is crucial for processing complex documents like multi-column layouts and nested tables [12][13]. Performance Metrics - DeepSeek-OCR 2's edit-distance metric improved from 0.085 to 0.057, validating its structure-first reading approach [14]. - Compared to competitors, DeepSeek-OCR 2's performance is approaching industry leaders, with a document-parsing edit distance of 0.100, outperforming Gemini 3 Pro [14]. Real-World Applications - The model's open-source nature and moderate parameter size (3 billion) facilitate its integration into existing enterprise workflows, with potential applications in PDF-to-Markdown conversion and structured data extraction [15]. - Feedback from production environments indicates a significant reduction in text duplication rates, suggesting improved reliability in practical applications [15]. Long-Term Vision - The development of DeepSeek-OCR 2 is seen as an exploration of architectural innovation, aiming to enhance the capabilities of vision-language models and improve the generation of structured training data for large language models [16]. - The team has outlined clear iterative directions for future improvements, focusing on enhancing performance for text-dense documents [16].
让GPT-4o准确率大降,这个文档理解新基准揭秘大模型短板
机器之心· 2025-05-24 04:07
Core Viewpoint - The article discusses the development of WildDoc, a benchmark dataset for real-world document understanding, highlighting the limitations of existing multimodal large models (MLLMs) in handling complex document scenarios [1][3][19]. Group 1: Limitations of Existing Models - Current MLLMs have shown significant performance drops when evaluated on WildDoc compared to traditional benchmarks like DocVQA, with models like GPT-4o experiencing an average accuracy decline of 35.3% [12][13]. - The existing benchmarks fail to simulate the complexities of real-world environments, leading to doubts about the models' performance in practical applications [5][11]. Group 2: WildDoc Dataset - WildDoc consists of over 12,000 manually captured images of documents, simulating challenges such as lighting, distortion, and varying angles, which are critical for assessing model robustness [3][7]. - The dataset introduces a consistency score metric to evaluate model stability across different conditions, revealing performance bottlenecks in current MLLMs [3][19]. Group 3: Experimental Findings - The experiments indicate that physical distortions (wrinkles, bends) are the most challenging factors for model performance, with GPT-4o's accuracy dropping by 34.1-34.7% under such conditions [13][16]. - Non-frontal angles and image quality significantly affect performance, while larger models do not necessarily overcome the challenges posed by real-world scenarios [13][16]. Group 4: Future Directions - The research team suggests several strategies for improving MLLMs, including data augmentation to simulate real-world conditions, robust feature learning to enhance model adaptability, and the incorporation of more real-world document images into training datasets [19].