Workflow
检索增强生成技术(RAG)
icon
Search documents
ICML 2025 Spotlight|南洋理工陶大程教授团队等提出基于RAG的高分辨率图像感知框架,准确率提高20%
机器之心· 2025-05-16 16:31
Core Viewpoint - The article discusses the development of Retrieval-Augmented Perception (RAP), a method that enhances multi-modal large language models (MLLMs) for high-resolution image perception without requiring training [3][29]. Group 1: Challenges in High-Resolution Image Processing - Traditional multi-modal large language models (MLLMs) struggle with high-resolution images, often leading to loss of visual information due to fixed input resolutions [1][2]. - Current solutions include cropping high-resolution images into smaller segments, using visual encoders that handle higher resolutions, and search-based methods that construct tree structures for image retrieval [2][3]. Group 2: Introduction of RAP - RAP is introduced as a solution that leverages retrieval-augmented generation (RAG) techniques to improve MLLM's perception of high-resolution images [3][29]. - The method has been accepted at ICML 2025 and recognized as a Spotlight paper, indicating its significance in the field [3]. Group 3: Experimental Findings - The research explores the layout of retrieved image segments, the impact of the number of segments on performance, and how to effectively apply RAG in MLLMs [6][11]. - Maintaining the relative position of retrieved image segments is crucial, especially for tasks requiring spatial awareness [10][15]. - The number of retrieved segments affects performance differently across tasks, with fewer segments being beneficial for single-instance perception tasks (FSP) and more segments needed for multi-instance perception tasks (FCP) [14][24]. Group 4: Methodology of RAP - RAP employs a Spatial-Awareness Layout algorithm to maintain the relative positions of image segments while reducing resolution [16][19]. - The RE-Search component adapts the number of retained segments based on similarity scores and model confidence, enhancing the overall performance [20][22]. Group 5: Performance Results - Experimental results show that RAP significantly improves performance on high-resolution image perception tasks, achieving up to 21% accuracy improvement on HR-Bench datasets [25][26]. - Compared to other methods, RAP demonstrates superior throughput and accuracy, outperforming existing search-based methods [27].