Workflow
混合思维
icon
Search documents
告别被动感知!DriveAgent-R1:主动视觉探索的混合思维高级Agent
自动驾驶之心· 2025-08-01 07:05
Core Insights - DriveAgent-R1 is an advanced autonomous driving agent designed to tackle long-term, high-level behavioral decision-making challenges, leveraging a hybrid thinking framework and active perception to enhance decision-making capabilities in complex environments [3][4][32]. Innovation and Methodology - DriveAgent-R1 introduces two core innovations: a novel three-stage progressive reinforcement learning strategy and a mode grouping algorithm (MP-GRPO) that enhances the agent's dual-mode specificity, laying the groundwork for autonomous exploration [4][13]. - The agent's decision-making process is driven by active perception, allowing it to proactively seek information to reduce uncertainty, which is crucial for safe and reliable driving [5][6][32]. Performance Metrics - DriveAgent-R1 achieved state-of-the-art (SOTA) performance on the challenging SUP-AD dataset, surpassing leading multimodal models such as Claude Sonnet 4 and Gemini 2.5 Flash [4][13][27]. - The model demonstrated significant improvements in accuracy metrics, with first frame accuracy increasing by 14.2% and sequence average accuracy by 15.9% when utilizing visual tools [27][28]. Training Strategy - The training strategy consists of three phases: dual-mode supervised fine-tuning (DM-SFT), forced comparative mode reinforcement learning (FCM-RL), and adaptive mode selection reinforcement learning (AMS-RL), which collectively enhance the agent's ability to choose the optimal thinking mode based on context [24][30]. - The gradual training approach effectively transformed potential distractions from visual tools into performance amplifiers, significantly improving the agent's decision-making capabilities [28][30]. Active Perception and Visual Tools - Active perception is integrated into DriveAgent-R1, equipping it with a robust visual toolkit that allows the agent to actively explore its environment, thereby enhancing its perceptual robustness [5][19]. - The visual toolkit includes features such as high-resolution view retrieval, region of interest inspection, depth estimation, and 3D object detection, which collectively improve the agent's ability to make informed decisions in uncertain conditions [19][20]. Experimental Results - The experiments confirmed that reinforcement learning (RL) is critical for unlocking the agent's potential, with RL-trained variants significantly outperforming those trained solely through supervised fine-tuning [29][30]. - The results indicated that DriveAgent-R1's performance is heavily reliant on visual inputs, with a drastic drop in accuracy when visual information is removed, underscoring the importance of its active perception mechanism [31].
自动驾驶Agent来了!DriveAgent-R1:智能思维和主动感知Agent(上海期智&理想)
自动驾驶之心· 2025-07-29 23:32
Core Viewpoint - DriveAgent-R1 represents a significant advancement in autonomous driving technology, addressing long-term, high-level decision-making challenges through a hybrid thinking framework and active perception mechanism [2][31]. Group 1: Innovations and Challenges - DriveAgent-R1 introduces two core innovations: a novel three-stage progressive reinforcement learning strategy and the MP-GRPO (Mode Grouped Reinforcement Policy Optimization) to enhance the agent's dual-mode specificity capabilities [3][12]. - The current potential of Visual Language Models (VLM) in autonomous driving is limited by short-sighted decision-making and passive perception, particularly in complex environments [2][4]. Group 2: Hybrid Thinking and Active Perception - The hybrid thinking framework allows the agent to adaptively switch between efficient text-based reasoning and in-depth tool-assisted reasoning based on scene complexity [5][12]. - The active perception mechanism equips the agent with a powerful visual toolbox to actively explore the environment, improving decision-making transparency and reliability [5][12]. Group 3: Training Strategy and Performance - A complete three-stage progressive training strategy is designed, focusing on dual-mode supervised fine-tuning, forced comparative mode reinforcement learning, and adaptive mode selection reinforcement learning [24][29]. - DriveAgent-R1 achieves state-of-the-art (SOTA) performance on challenging datasets, surpassing leading multimodal models like Claude Sonnet 4 and Gemini 2.5 Flash [12][26]. Group 4: Experimental Results - Experimental results show that DriveAgent-R1 significantly outperforms baseline models, with first frame accuracy increasing by 14.2% and sequence average accuracy by 15.9% when using visual tools [26][27]. - The introduction of visual tools enhances the decision-making capabilities of state-of-the-art VLMs, demonstrating the potential of actively acquiring visual information in driving intelligence [27]. Group 5: Active Perception and Visual Dependency - Active perception is crucial for deep visual reliance, as evidenced by the drastic performance drop of DriveAgent-R1 when visual inputs are removed, confirming its decisions are genuinely driven by visual data [30][31]. - The training strategy effectively transforms potential distractions from tools into performance amplifiers, showcasing the importance of structured training in utilizing visual tools [27][29].