Workflow
Autos(自动驾驶相关)
icon
Search documents
FSDrive统一VLA和世界模型,推动自动驾驶迈向视觉推理
3 6 Ke· 2025-09-30 10:36
Core Insights - FSDrive introduces a "Spatio-Temporal Chain-of-Thought" (CoT) that allows models to reason directly with images, addressing the limitations of existing methods that rely heavily on symbolic representations [1][4][17] Group 1: Methodology - The proposed method utilizes a unified future image frame as an intermediary reasoning step, integrating future scenarios and perception results for visual reasoning [4][17] - FSDrive activates image generation capabilities in existing Multi-Modal Large Language Models (MLLM) by expanding the vocabulary with visual tokens, avoiding major architectural changes [5][17] - The approach employs a progressive visual CoT, starting with coarse-grained perception maps (lane lines and 3D boxes) and gradually refining to detailed future frames, explicitly incorporating physical constraints [5][8] Group 2: Performance Metrics - FSDrive demonstrates superior performance in trajectory planning, achieving lower average L2 error (0.53 vs 0.70) and collision rates (0.19 vs 0.21) compared to Doe-1 [9] - In terms of future frame generation quality, FSDrive achieves a FID score of 10.1, outperforming many diffusion-based world models and maintaining real-time capabilities [11] - The model also shows strong results in scene understanding, with a final score of 0.57, surpassing competitors like OminiDrive [14] Group 3: Applications and Implications - FSDrive's dual role as a "world model" for future frame generation and as an "inverse dynamics model" for trajectory planning enhances its interpretability and decision-making capabilities [8][16] - The framework's ability to reduce potential collisions through visual reasoning reflects its practical applicability in real-world autonomous driving scenarios [16][17] - The method's efficiency allows for significant data and computational cost savings, making it a competitive option in the evolving landscape of autonomous driving technologies [17]