突破高分辨率图像推理瓶颈，复旦联合南洋理工提出基于视觉Grounding的多轮强化学习框架MGPO

Core Insights - The article discusses the development of a multi-turn reinforcement learning method called MGPO, which enhances the visual reasoning capabilities of large multi-modal models (LMMs) when processing high-resolution images [1][8][21] - MGPO allows LMMs to automatically predict key area coordinates and crop sub-images based on questions, improving the model's ability to focus on relevant information without requiring expensive grounding annotations [2][21] Summary by Sections Introduction - Current LMMs, such as Qwen2.5-VL, face challenges in processing high-resolution images due to the conversion of images into a large number of visual tokens, many of which are irrelevant to the task [5][6] - The human visual system employs a task-driven visual search strategy, which MGPO aims to replicate by enabling LMMs to focus on key areas of images [6][7] Method Overview - MGPO simulates a multi-step visual reasoning process where the model first predicts key area coordinates and then crops sub-images for further reasoning [10][21] - The method overcomes the limitations of traditional visual grounding models that require extensive grounding annotations for training [7][21] Key Innovations of MGPO - A top-down, interpretable visual reasoning mechanism that allows LMMs to conduct problem-driven visual searches [2] - The ability to accurately identify relevant area coordinates from high-resolution images, even when visual tokens are limited [2] - The model can be trained on standard Visual Question Answering (VQA) datasets without additional grounding annotations, relying solely on answer correctness for feedback [2][21] Experimental Results - MGPO demonstrated significant performance improvements over other methods like SFT and GRPO, achieving increases of 5.4% and 5.2% in benchmark tests [18][19] - The model outperformed OpenAI's models despite being trained on a smaller dataset, showcasing its effectiveness [18][19] - The proportion of effective grounding coordinates generated by MGPO increased significantly during training, indicating its ability to develop robust visual grounding capabilities autonomously [20] Conclusion - MGPO effectively addresses issues of visual token redundancy and key information loss in high-resolution image processing [21] - The method proves that reinforcement learning can foster robust grounding capabilities without the need for costly annotations, enhancing the efficiency of LMMs [21]