MTRDrive
Search documents
MTRDrive:一种具备动态交互式推理的自动驾驶VLA框架(清华&小米)
自动驾驶之心· 2025-09-28 23:33
Core Insights - The article discusses the MTRDrive framework, which models autonomous driving as a dynamic interactive reasoning process, addressing the limitations of traditional static decision-making approaches [4][9][50] - MTRDrive integrates a memory-tool synergistic mechanism to enhance perception accuracy and reasoning reliability, significantly improving the model's robustness in long-tail and out-of-distribution (OOD) scenarios [4][13][50] Group 1: Challenges in Autonomous Driving - Current visual-language-action (VLA) models face significant challenges in long-term reasoning and high-level decision-making, particularly in complex scenarios with few or no samples [3][5] - Robust driving decisions rely heavily on the deep collaboration of perception accuracy and reasoning reliability, akin to human drivers who utilize accumulated experience for dynamic prediction and adaptive adjustments [3][8] Group 2: MTRDrive Framework - MTRDrive is a new framework proposed by teams from Tsinghua University, Xiaomi Auto, McGill University, and the University of Wisconsin-Madison, which breaks the limitations of traditional static decision-making [4][9] - The framework includes a memory-tool collaborative mechanism that enhances the model's perception accuracy and supports robust decision-making in long-term and high-level tasks [4][15] Group 3: Experimental Validation - Systematic experiments demonstrate that MTRDrive significantly improves generalization and robustness in long-tail and OOD scenarios, providing a new technical pathway for deploying autonomous agents in real-world complex environments [4][34] - In high-level planning tasks, MTRDrive achieved a planning accuracy of 82.6% on the NAVSIM dataset, more than double that of the Qwen2.5-VL-72B model [40] Group 4: Memory and Tool Interaction - MTRDrive incorporates a structured driving experience repository that allows the model to retrieve relevant past experiences, enhancing its decision-making capabilities [15][19] - The framework employs a visual toolset that enables the model to actively probe the visual environment for high-fidelity information, improving its perception capabilities [21][28] Group 5: Training Methodology - MTRDrive utilizes a two-phase training process: supervised fine-tuning (SFT) to teach basic skills and reinforcement learning fine-tuning (RLFT) for optimizing decision-making capabilities [24][29] - The introduction of a memory retrieval mechanism significantly enhances the model's ability to generalize skills to new, unseen driving scenarios, as evidenced by improved performance metrics [44]