高保真图像处理
Search documents
大模型如何准确读懂图表?微软亚研院教它“看、动手、推理”
量子位· 2025-11-03 03:12
Core Insights - The article discusses the advancements of PixelCraft, a system developed by Microsoft Research Asia in collaboration with Tsinghua University and Hong Kong University of Science and Technology, aimed at improving the understanding of structured images through high-fidelity image processing and nonlinear multi-agent reasoning [2][31]. Group 1: Challenges in Structured Image Understanding - Traditional models struggle with structured images like charts and scientific drawings due to the need for pixel-level detail and symbolic abstraction, which is not adequately addressed by existing methods [3][4]. - The limitations of linear "chain-of-thought" processes hinder the necessary backtracking and branching exploration required for complex tasks [2][5]. Group 2: PixelCraft's Approach - PixelCraft addresses these challenges by focusing on two main aspects: ensuring accurate perception ("seeing clearly") and enabling flexible reasoning ("thinking flexibly") [5]. - The system comprises several components, including a dispatcher, planner, reasoner, visual and planning critics, and a set of visual tool agents, which work together to enhance structured image understanding [7][31]. Group 3: High-Fidelity Image Processing - The system utilizes a finely-tuned grounding model to accurately map textual references to pixel-level coordinates, facilitating a semi-automated tool generation process for image editing [10][13]. - A three-stage workflow is established, focusing on tool selection, collaborative discussion and backtracking, and self-review and re-planning, which allows for selective memory usage and reduces the burden of long contexts [7][18]. Group 4: Performance Improvements - PixelCraft demonstrates significant performance improvements across various benchmarks, such as CharXiv, ChartQAPro, and EvoChart, showing consistent gains across different models [23][32]. - The system's ability to reduce error propagation through high-fidelity localization and a closed-loop tool approach is highlighted, leading to enhanced accuracy and robustness in reasoning for structured images [18][33]. Group 5: Experimental Results - The article presents comparative performance data, indicating that PixelCraft outperforms traditional methods like VisualCoT in structured image tasks, emphasizing the importance of selective memory and discussion-based backtracking [27][28]. - Specific tools for chart analysis, such as subplot cropping and auxiliary line annotation, are identified as essential for effective reasoning in structured image contexts [29][30].