时空视觉CoT
Search documents
FSDrive统一VLA和世界模型,推动自动驾驶迈向视觉推理
3 6 Ke· 2025-09-30 10:36
Core Insights - FSDrive introduces a "Spatio-Temporal Chain-of-Thought" (CoT) that allows models to reason directly with images, addressing the limitations of existing methods that rely heavily on symbolic representations [1][4][17] Group 1: Methodology - The proposed method utilizes a unified future image frame as an intermediary reasoning step, integrating future scenarios and perception results for visual reasoning [4][17] - FSDrive activates image generation capabilities in existing Multi-Modal Large Language Models (MLLM) by expanding the vocabulary with visual tokens, avoiding major architectural changes [5][17] - The approach employs a progressive visual CoT, starting with coarse-grained perception maps (lane lines and 3D boxes) and gradually refining to detailed future frames, explicitly incorporating physical constraints [5][8] Group 2: Performance Metrics - FSDrive demonstrates superior performance in trajectory planning, achieving lower average L2 error (0.53 vs 0.70) and collision rates (0.19 vs 0.21) compared to Doe-1 [9] - In terms of future frame generation quality, FSDrive achieves a FID score of 10.1, outperforming many diffusion-based world models and maintaining real-time capabilities [11] - The model also shows strong results in scene understanding, with a final score of 0.57, surpassing competitors like OminiDrive [14] Group 3: Applications and Implications - FSDrive's dual role as a "world model" for future frame generation and as an "inverse dynamics model" for trajectory planning enhances its interpretability and decision-making capabilities [8][16] - The framework's ability to reduce potential collisions through visual reasoning reflects its practical applicability in real-world autonomous driving scenarios [16][17] - The method's efficiency allows for significant data and computational cost savings, making it a competitive option in the evolving landscape of autonomous driving technologies [17]
NeurIPS 2025 Spotlight | FSDrive统一VLA和世界模型,推动自动驾驶迈向视觉推理
机器之心· 2025-09-30 08:45
Core Insights - The article introduces FSDrive, a novel approach that utilizes "Spatio-Temporal Chain-of-Thought" (CoT) to enhance visual reasoning in autonomous driving, moving away from traditional symbolic logic to a more intuitive visual simulation and imagination process [7][28]. Group 1: Methodology and Innovations - FSDrive proposes a unified "visual intermediary" that replaces text or tabular mediators, effectively eliminating cross-modal semantic gaps [8]. - The method activates image generation capabilities on existing Multi-Modal Large Language Models (MLLM) with minimal cost by expanding the vocabulary to include visual tokens, avoiding major architectural changes or extensive retraining [8][19]. - A progressive visual CoT is employed, starting with coarse-grained perception maps (lane lines and 3D boxes) and gradually generating detailed future frames, explicitly injecting physical realism [8][19]. Group 2: Performance and Metrics - FSDrive demonstrates competitive performance in trajectory planning and scene understanding, achieving an average L2 error of 0.53 and a collision rate of 0.19, outperforming existing methods like UniAD [29][22]. - The quality of future frame generation is indicated by a FID score of 10.1 at a resolution of 128×192, surpassing many diffusion-based world models [22]. - In scene understanding tasks, FSDrive achieves a final score of 0.57, exceeding other recent methods, showcasing the effectiveness of its unified pre-training approach [25]. Group 3: Practical Applications and Future Directions - FSDrive maintains an end-to-end simple link and interpretable visual reasoning while leveraging large amounts of unannotated video data to learn world evolution patterns [9]. - The framework is adaptable to mainstream MLLMs, indicating its potential for broad application in the autonomous driving industry [20]. - Future developments may include expanding the model to predict a unified panoramic view while addressing safety, privacy, and regulatory compliance issues as the technology matures [30].