交互式强化学习
Search documents
AGILE:视觉学习新范式!自监督+交互式强化学习助力VLMs感知与推理全面提升
机器之心· 2025-10-20 07:48
Core Insights - Existing Vision-Language Models (VLMs) exhibit significant limitations in fine-grained visual information understanding and reasoning capabilities, which have not been fully activated [2] - AGILE introduces a novel self-supervised learning paradigm that enhances VLMs' visual perception and reasoning through an interactive agent-based approach [2][22] Methodology - AGILE employs a "puzzle" task as an efficient agent task that combines perception and reasoning, structured as a controllable and verifiable interactive form [8] - The training process consists of two phases: a Cold-Start phase using Gemini 2.5 Pro to generate 1.6K high-quality expert puzzle interaction trajectories, and a Reinforcement Learning phase training on 15.6K images using the GRPO algorithm [9][10] Experimental Results - In the simplest 2x2 puzzle task, AGILE improved accuracy from 9.5% to 82.8%, surpassing Gemini 2.5 Pro by 36.4 percentage points. In the more challenging 3x3 puzzle, accuracy increased from 0.4% to 20.8% [13] - The model's performance was evaluated using two metrics: Acc (the proportion of all blocks placed correctly) and Score (the proportion of correctly placed blocks) [13][14] Generalization Capability - After puzzle training, the model demonstrated an average improvement of 3.1% across nine general visual tasks, indicating strong generalization capabilities [15] Scaling Experiments - The study explored the impact of puzzle data scale on performance, revealing that as training data expanded from 0 to 16K, puzzle task accuracy increased from 22.0% to 82.8% [20] - Replacing 10K of conventional QA data with puzzle data in a 20K sample led to better model performance, highlighting the potential of puzzle tasks in alleviating data scarcity in multi-modal reinforcement learning [20]
任务级奖励提升App Agent思考力,淘天提出Mobile-R1,3B模型可超32B
量子位· 2025-07-20 02:49
Core Insights - The article discusses the limitations of existing Mobile/APP Agents that primarily rely on action-level rewards, which restrict their adaptability in dynamic environments [1][2] - A new interactive reinforcement learning framework called Mobile-R1 is proposed, which incorporates task-level rewards to enhance agent adaptability and exploration capabilities [5][30] - The training process for Mobile-R1 consists of three stages: format fine-tuning, action-level training, and task-level training, which collectively improve the model's performance [6][31] Summary by Sections Existing Limitations - Current Mobile/APP Agents struggle with real-time adaptability due to their reliance on action-level rewards, making it difficult to handle changing mobile environments [1][2] - An example illustrates the failure of existing models in executing complex multi-step tasks [3] Proposed Solution - The collaboration between TaoTian Group's algorithm team and Future Life Lab introduces a multi-round, task-oriented learning approach that combines online learning and trajectory correction [4] - Mobile-R1 is designed to utilize task-level rewards, which are more effective in guiding agents through complex tasks [5] Training Methodology - The training process is divided into three stages: 1. **Format Fine-tuning**: Initial adjustments using supervised fine-tuning with high-quality trajectory data [16] 2. **Action-level Training**: Utilizes group relative policy optimization (GRPO) to evaluate action correctness with action-level rewards [17] 3. **Task-level Training**: Enhances model generalization and exploration through multi-step task-level training [18][20] Experimental Results - Mobile-R1 demonstrated superior performance across various benchmarks, achieving a task success rate of 49.40%, significantly higher than the best baseline model [26] - The results indicate that the three-stage training process effectively improves the model's robustness and adaptability, particularly in dynamic environments [29][30] - The article concludes that Mobile-R1's integration of interactive reinforcement learning and task-level rewards significantly enhances the capabilities of visual language model-based mobile agents [30][32]