Workflow
混合思维
icon
Search documents
告别被动感知!DriveAgent-R1:主动视觉探索的混合思维高级Agent
自动驾驶之心· 2025-08-01 07:05
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 今天自动驾驶之心为大家分享 上海期智研究院、理想、同济和清华团队最新的工作— DriveAgent-R1 ! 自动驾驶Agent时代来临, 以混合思维 和主动感知推动基于VLM的自动驾驶发展 。如果您有相关工作需要分享,请在文末联系我们! 自动驾驶课程学习与 技术交流群加入 ,也欢迎添加小助理微信AIDriver005 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 论文作者 | Weicheng Zheng等 编辑 | 自动驾驶之心 写在前面 & 笔者的个人理解 DriveAgent-R1 是一款为解决长时程、高层级行为决策挑战而设计的先进自动驾驶智能体。当前VLM在自动驾驶领域的潜力,因其短视的决策模式和被动的感知方 式而受到限制,尤其在复杂环境中可靠性不足。 为应对这些挑战,DriveAgent-R1 引入了两大核心创新: 因此,我们的核心任务是:赋能智能体进行长时程、高层级的行为决策,同时,当面临不确定性时,能像人类驾驶员一样主动地从环境中寻求关键信息。 上图生动展示了DriveAgent-R1在 ...
自动驾驶Agent来了!DriveAgent-R1:智能思维和主动感知Agent(上海期智&理想)
自动驾驶之心· 2025-07-29 23:32
Core Viewpoint - DriveAgent-R1 represents a significant advancement in autonomous driving technology, addressing long-term, high-level decision-making challenges through a hybrid thinking framework and active perception mechanism [2][31]. Group 1: Innovations and Challenges - DriveAgent-R1 introduces two core innovations: a novel three-stage progressive reinforcement learning strategy and the MP-GRPO (Mode Grouped Reinforcement Policy Optimization) to enhance the agent's dual-mode specificity capabilities [3][12]. - The current potential of Visual Language Models (VLM) in autonomous driving is limited by short-sighted decision-making and passive perception, particularly in complex environments [2][4]. Group 2: Hybrid Thinking and Active Perception - The hybrid thinking framework allows the agent to adaptively switch between efficient text-based reasoning and in-depth tool-assisted reasoning based on scene complexity [5][12]. - The active perception mechanism equips the agent with a powerful visual toolbox to actively explore the environment, improving decision-making transparency and reliability [5][12]. Group 3: Training Strategy and Performance - A complete three-stage progressive training strategy is designed, focusing on dual-mode supervised fine-tuning, forced comparative mode reinforcement learning, and adaptive mode selection reinforcement learning [24][29]. - DriveAgent-R1 achieves state-of-the-art (SOTA) performance on challenging datasets, surpassing leading multimodal models like Claude Sonnet 4 and Gemini 2.5 Flash [12][26]. Group 4: Experimental Results - Experimental results show that DriveAgent-R1 significantly outperforms baseline models, with first frame accuracy increasing by 14.2% and sequence average accuracy by 15.9% when using visual tools [26][27]. - The introduction of visual tools enhances the decision-making capabilities of state-of-the-art VLMs, demonstrating the potential of actively acquiring visual information in driving intelligence [27]. Group 5: Active Perception and Visual Dependency - Active perception is crucial for deep visual reliance, as evidenced by the drastic performance drop of DriveAgent-R1 when visual inputs are removed, confirming its decisions are genuinely driven by visual data [30][31]. - The training strategy effectively transforms potential distractions from tools into performance amplifiers, showcasing the importance of structured training in utilizing visual tools [27][29].