多模态思维链 (M-CoT) - filings, earnings calls, financial reports, news

多模态思维链 (M-CoT)

Search documents

自动驾驶Agent来了！DriveAgent-R1：智能思维和主动感知Agent（上海期智&理想）

自动驾驶之心· 2025-07-29 23:32

Core Viewpoint - DriveAgent-R1 represents a significant advancement in autonomous driving technology, addressing long-term, high-level decision-making challenges through a hybrid thinking framework and active perception mechanism [2][31]. Group 1: Innovations and Challenges - DriveAgent-R1 introduces two core innovations: a novel three-stage progressive reinforcement learning strategy and the MP-GRPO (Mode Grouped Reinforcement Policy Optimization) to enhance the agent's dual-mode specificity capabilities [3][12]. - The current potential of Visual Language Models (VLM) in autonomous driving is limited by short-sighted decision-making and passive perception, particularly in complex environments [2][4]. Group 2: Hybrid Thinking and Active Perception - The hybrid thinking framework allows the agent to adaptively switch between efficient text-based reasoning and in-depth tool-assisted reasoning based on scene complexity [5][12]. - The active perception mechanism equips the agent with a powerful visual toolbox to actively explore the environment, improving decision-making transparency and reliability [5][12]. Group 3: Training Strategy and Performance - A complete three-stage progressive training strategy is designed, focusing on dual-mode supervised fine-tuning, forced comparative mode reinforcement learning, and adaptive mode selection reinforcement learning [24][29]. - DriveAgent-R1 achieves state-of-the-art (SOTA) performance on challenging datasets, surpassing leading multimodal models like Claude Sonnet 4 and Gemini 2.5 Flash [12][26]. Group 4: Experimental Results - Experimental results show that DriveAgent-R1 significantly outperforms baseline models, with first frame accuracy increasing by 14.2% and sequence average accuracy by 15.9% when using visual tools [26][27]. - The introduction of visual tools enhances the decision-making capabilities of state-of-the-art VLMs, demonstrating the potential of actively acquiring visual information in driving intelligence [27]. Group 5: Active Perception and Visual Dependency - Active perception is crucial for deep visual reliance, as evidenced by the drastic performance drop of DriveAgent-R1 when visual inputs are removed, confirming its decisions are genuinely driven by visual data [30][31]. - The training strategy effectively transforms potential distractions from tools into performance amplifiers, showcasing the importance of structured training in utilizing visual tools [27][29].