文档理解 - filings, earnings calls, financial reports, news

文档理解

Search documents

机器之心· 2025-05-24 04:07

Core Viewpoint - The article discusses the development of WildDoc, a benchmark dataset for real-world document understanding, highlighting the limitations of existing multimodal large models (MLLMs) in handling complex document scenarios [1][3][19]. Group 1: Limitations of Existing Models - Current MLLMs have shown significant performance drops when evaluated on WildDoc compared to traditional benchmarks like DocVQA, with models like GPT-4o experiencing an average accuracy decline of 35.3% [12][13]. - The existing benchmarks fail to simulate the complexities of real-world environments, leading to doubts about the models' performance in practical applications [5][11]. Group 2: WildDoc Dataset - WildDoc consists of over 12,000 manually captured images of documents, simulating challenges such as lighting, distortion, and varying angles, which are critical for assessing model robustness [3][7]. - The dataset introduces a consistency score metric to evaluate model stability across different conditions, revealing performance bottlenecks in current MLLMs [3][19]. Group 3: Experimental Findings - The experiments indicate that physical distortions (wrinkles, bends) are the most challenging factors for model performance, with GPT-4o's accuracy dropping by 34.1-34.7% under such conditions [13][16]. - Non-frontal angles and image quality significantly affect performance, while larger models do not necessarily overcome the challenges posed by real-world scenarios [13][16]. Group 4: Future Directions - The research team suggests several strategies for improving MLLMs, including data augmentation to simulate real-world conditions, robust feature learning to enhance model adaptability, and the incorporation of more real-world document images into training datasets [19].

多模态大模型（MLLMs）

文档理解

Artificial Intelligence

Artificial Intelligence

WildDoc

GPT-4o

Qwen2.5-VL-72B