告别被动感知！DriveAgent-R1：主动视觉探索的混合思维高级Agent

Core Insights - DriveAgent-R1 is an advanced autonomous driving agent designed to tackle long-term, high-level behavioral decision-making challenges, leveraging a hybrid thinking framework and active perception to enhance decision-making capabilities in complex environments [3][4][32]. Innovation and Methodology - DriveAgent-R1 introduces two core innovations: a novel three-stage progressive reinforcement learning strategy and a mode grouping algorithm (MP-GRPO) that enhances the agent's dual-mode specificity, laying the groundwork for autonomous exploration [4][13]. - The agent's decision-making process is driven by active perception, allowing it to proactively seek information to reduce uncertainty, which is crucial for safe and reliable driving [5][6][32]. Performance Metrics - DriveAgent-R1 achieved state-of-the-art (SOTA) performance on the challenging SUP-AD dataset, surpassing leading multimodal models such as Claude Sonnet 4 and Gemini 2.5 Flash [4][13][27]. - The model demonstrated significant improvements in accuracy metrics, with first frame accuracy increasing by 14.2% and sequence average accuracy by 15.9% when utilizing visual tools [27][28]. Training Strategy - The training strategy consists of three phases: dual-mode supervised fine-tuning (DM-SFT), forced comparative mode reinforcement learning (FCM-RL), and adaptive mode selection reinforcement learning (AMS-RL), which collectively enhance the agent's ability to choose the optimal thinking mode based on context [24][30]. - The gradual training approach effectively transformed potential distractions from visual tools into performance amplifiers, significantly improving the agent's decision-making capabilities [28][30]. Active Perception and Visual Tools - Active perception is integrated into DriveAgent-R1, equipping it with a robust visual toolkit that allows the agent to actively explore its environment, thereby enhancing its perceptual robustness [5][19]. - The visual toolkit includes features such as high-resolution view retrieval, region of interest inspection, depth estimation, and 3D object detection, which collectively improve the agent's ability to make informed decisions in uncertain conditions [19][20]. Experimental Results - The experiments confirmed that reinforcement learning (RL) is critical for unlocking the agent's potential, with RL-trained variants significantly outperforming those trained solely through supervised fine-tuning [29][30]. - The results indicated that DriveAgent-R1's performance is heavily reliant on visual inputs, with a drastic drop in accuracy when visual information is removed, underscoring the importance of its active perception mechanism [31].