FSDrive
Search documents
FSDrive统一VLA和世界模型,推动自动驾驶迈向视觉推理
3 6 Ke· 2025-09-30 10:36
Core Insights - FSDrive introduces a "Spatio-Temporal Chain-of-Thought" (CoT) that allows models to reason directly with images, addressing the limitations of existing methods that rely heavily on symbolic representations [1][4][17] Group 1: Methodology - The proposed method utilizes a unified future image frame as an intermediary reasoning step, integrating future scenarios and perception results for visual reasoning [4][17] - FSDrive activates image generation capabilities in existing Multi-Modal Large Language Models (MLLM) by expanding the vocabulary with visual tokens, avoiding major architectural changes [5][17] - The approach employs a progressive visual CoT, starting with coarse-grained perception maps (lane lines and 3D boxes) and gradually refining to detailed future frames, explicitly incorporating physical constraints [5][8] Group 2: Performance Metrics - FSDrive demonstrates superior performance in trajectory planning, achieving lower average L2 error (0.53 vs 0.70) and collision rates (0.19 vs 0.21) compared to Doe-1 [9] - In terms of future frame generation quality, FSDrive achieves a FID score of 10.1, outperforming many diffusion-based world models and maintaining real-time capabilities [11] - The model also shows strong results in scene understanding, with a final score of 0.57, surpassing competitors like OminiDrive [14] Group 3: Applications and Implications - FSDrive's dual role as a "world model" for future frame generation and as an "inverse dynamics model" for trajectory planning enhances its interpretability and decision-making capabilities [8][16] - The framework's ability to reduce potential collisions through visual reasoning reflects its practical applicability in real-world autonomous driving scenarios [16][17] - The method's efficiency allows for significant data and computational cost savings, making it a competitive option in the evolving landscape of autonomous driving technologies [17]
NeurIPS 2025 Spotlight | FSDrive统一VLA和世界模型,推动自动驾驶迈向视觉推理
机器之心· 2025-09-30 08:45
Core Insights - The article introduces FSDrive, a novel approach that utilizes "Spatio-Temporal Chain-of-Thought" (CoT) to enhance visual reasoning in autonomous driving, moving away from traditional symbolic logic to a more intuitive visual simulation and imagination process [7][28]. Group 1: Methodology and Innovations - FSDrive proposes a unified "visual intermediary" that replaces text or tabular mediators, effectively eliminating cross-modal semantic gaps [8]. - The method activates image generation capabilities on existing Multi-Modal Large Language Models (MLLM) with minimal cost by expanding the vocabulary to include visual tokens, avoiding major architectural changes or extensive retraining [8][19]. - A progressive visual CoT is employed, starting with coarse-grained perception maps (lane lines and 3D boxes) and gradually generating detailed future frames, explicitly injecting physical realism [8][19]. Group 2: Performance and Metrics - FSDrive demonstrates competitive performance in trajectory planning and scene understanding, achieving an average L2 error of 0.53 and a collision rate of 0.19, outperforming existing methods like UniAD [29][22]. - The quality of future frame generation is indicated by a FID score of 10.1 at a resolution of 128×192, surpassing many diffusion-based world models [22]. - In scene understanding tasks, FSDrive achieves a final score of 0.57, exceeding other recent methods, showcasing the effectiveness of its unified pre-training approach [25]. Group 3: Practical Applications and Future Directions - FSDrive maintains an end-to-end simple link and interpretable visual reasoning while leveraging large amounts of unannotated video data to learn world evolution patterns [9]. - The framework is adaptable to mainstream MLLMs, indicating its potential for broad application in the autonomous driving industry [20]. - Future developments may include expanding the model to predict a unified panoramic view while addressing safety, privacy, and regulatory compliance issues as the technology matures [30].
NeurIPS'25 Spotlight!自驾新范式FSDrive: VLA + 世界模型双管齐下(阿里&西交)
自动驾驶之心· 2025-09-21 23:32
Core Insights - The article discusses the development of a spatio-temporal Chain-of-Thought (CoT) reasoning method for Vision-Language Models (VLMs) in the autonomous driving sector, emphasizing the need for visual reasoning over symbolic logic [1][4][24] - It introduces a unified pre-training paradigm that enhances the visual generation capabilities of VLMs while maintaining their semantic understanding [6][24] Summary by Sections Introduction - Multi-modal large language models (MLLMs) have shown exceptional performance in knowledge and reasoning, leading to their application in autonomous driving [4] - The end-to-end Vision-Language-Action (VLA) model simplifies system architecture and minimizes information loss by directly generating vehicle control commands from visual observations and language instructions [4] Methodology - The spatio-temporal CoT method allows VLMs to visualize and plan trajectories by generating unified image frames that predict future states, incorporating spatial and temporal relationships [5][11] - The proposed method integrates visual cues and physical constraints to guide the model's attention towards drivable areas and key objects, enhancing trajectory planning [5][16] Pre-training Paradigm - A new pre-training approach is introduced that combines visual understanding and generation, allowing VLMs to predict future frames while adhering to physical laws [6][12] - The gradual image generation method ensures that the model first predicts coarse-grained visual cues before generating detailed future frames, maintaining physical realism [15][24] Experimental Results - Extensive experiments validate the effectiveness of the FSDrive framework in trajectory planning, future frame generation, and scene understanding, demonstrating its advancement towards visual reasoning in autonomous driving [11][24] Conclusion - FSDrive establishes an end-to-end visual reasoning pipeline that unifies future scene generation and perception results, effectively bridging the semantic gap caused by cross-modal conversions [24]