字节开源高精度文档解析大模型Dolphin：轻量高效，性能超GPT4.1、Mistral-OCR！

Core Viewpoint - ByteDance has open-sourced a new document parsing model called Dolphin, which demonstrates significant performance improvements and efficiency in document analysis tasks compared to existing models [1][2]. Model Performance - Dolphin is a lightweight model that achieves nearly double the parsing efficiency compared to other models, surpassing the performance of GPT-4.1, Claude3.5-Sonnet, and Mistral-OCR in document parsing tasks [2][13]. - The model's architecture follows a two-stage parsing method, which includes layout parsing and content parsing, effectively addressing common issues in traditional OCR and multimodal models [6][9]. Technical Innovations - The innovative "analyze-then-parse" paradigm allows Dolphin to avoid error accumulation from multiple OCR models and enhances the efficiency of self-regressive decoding [6][8]. - The first stage generates a sequence of document elements based on natural reading order, while the second stage uses these elements as anchors for parallel content recognition [9]. Benchmark Comparisons - In benchmark tests, Dolphin achieved state-of-the-art performance in various page-level and element-level parsing tasks, outperforming both integration-based and larger VLMs [11][12]. - For plain documents, Dolphin recorded an edit distance of 0.0114 in English and 0.0131 in Chinese, outperforming specialized VLMs like GOT and general VLMs like GPT-4.1 [14]. - In complex documents, Dolphin achieved an edit distance of 0.1283, surpassing all baseline models [15]. Efficiency Metrics - Dolphin's parallel parsing design resulted in a significant efficiency boost, achieving 0.1729 frames per second (FPS), nearly double that of the most efficient baseline, Mathpix [16]. - The model's lightweight architecture (322 million parameters) allows it to maintain high performance compared to larger models [13]. Element-Level Parsing - Dolphin demonstrated competitive results in text paragraph parsing, formula recognition, and table parsing, achieving high scores in various benchmark tests [18][19][20]. - The model effectively captures structural relationships and cell content in tables, showcasing its versatility in handling complex document layouts [19]. Practical Applications - Real-world examples illustrate Dolphin's capabilities in accurately recognizing and efficiently processing multi-column academic papers, complex formulas, and bilingual tables [21]. - The model's output includes visualizations of layout analysis and specific element parsing results, demonstrating its practical utility in document processing [23][26].