Workflow
RewardMap
icon
Search documents
RewardMap: 通过多阶段强化学习解决细粒度视觉推理的Sparse Reward
机器之心· 2025-10-21 03:43
Core Insights - The article discusses the development of RewardMap, a multi-stage reinforcement learning framework designed to enhance the fine-grained visual reasoning capabilities of multi-modal large language models (MLLMs) in complex scenarios like high-resolution subway maps [3][9][17]. Group 1: Problem Identification - Recent advancements in large language models (LLMs) and multi-modal large models (MLLMs) have raised questions about their ability to interpret complex visual information, particularly in high-resolution and densely structured environments [3]. - The initial work, ReasonMap, revealed that even state-of-the-art MLLMs frequently make errors in path planning, such as misreading lines, missing stations, and repeating routes [3][12]. Group 2: Proposed Solution - The team introduced RewardMap, which employs a multi-stage reinforcement learning framework that incorporates fine-grained rewards and a curriculum-based training approach to improve MLLMs' visual understanding and spatial reasoning [3][10]. - RewardMap breaks down complex route planning tasks into smaller, assessable sub-goals, allowing for a more nuanced feedback mechanism rather than a binary correct/incorrect signal [10][11]. Group 3: Implementation Details - RewardMap is built on the foundation of ReasonMap and includes a dataset covering 30 cities with 4,018 problem samples, categorized into five types to provide detailed supervision during the reinforcement learning phase [6][12]. - The framework's reward function consists of three components: format compliance, final correctness, and detail, with difficulty weights applied to reflect the true complexity of the tasks [11][12]. Group 4: Performance Results - RewardMap demonstrated consistent performance improvements across various benchmarks, achieving a maximum increase of 13.51% in the SpatialEval metric compared to traditional methods [13][14]. - Qualitative comparisons showed that models trained with RewardMap exhibited fewer visual confusions and hallucinations, providing more accurate route information [14][15]. Group 5: Future Outlook - The value of RewardMap extends beyond performance metrics, offering a reusable reinforcement learning paradigm for high-resolution visual tasks by systematically breaking down complex problems into measurable sub-goals [17]. - The framework's effectiveness in enhancing the general capabilities of multi-modal large models has been validated, indicating that real-world data like maps will play a significant role in future developments [18].