统一视觉生成与理解预训练
Search documents
NeurIPS'25 Spotlight!自驾新范式FSDrive: VLA + 世界模型双管齐下(阿里&西交)
自动驾驶之心· 2025-09-21 23:32
Core Insights - The article discusses the development of a spatio-temporal Chain-of-Thought (CoT) reasoning method for Vision-Language Models (VLMs) in the autonomous driving sector, emphasizing the need for visual reasoning over symbolic logic [1][4][24] - It introduces a unified pre-training paradigm that enhances the visual generation capabilities of VLMs while maintaining their semantic understanding [6][24] Summary by Sections Introduction - Multi-modal large language models (MLLMs) have shown exceptional performance in knowledge and reasoning, leading to their application in autonomous driving [4] - The end-to-end Vision-Language-Action (VLA) model simplifies system architecture and minimizes information loss by directly generating vehicle control commands from visual observations and language instructions [4] Methodology - The spatio-temporal CoT method allows VLMs to visualize and plan trajectories by generating unified image frames that predict future states, incorporating spatial and temporal relationships [5][11] - The proposed method integrates visual cues and physical constraints to guide the model's attention towards drivable areas and key objects, enhancing trajectory planning [5][16] Pre-training Paradigm - A new pre-training approach is introduced that combines visual understanding and generation, allowing VLMs to predict future frames while adhering to physical laws [6][12] - The gradual image generation method ensures that the model first predicts coarse-grained visual cues before generating detailed future frames, maintaining physical realism [15][24] Experimental Results - Extensive experiments validate the effectiveness of the FSDrive framework in trajectory planning, future frame generation, and scene understanding, demonstrating its advancement towards visual reasoning in autonomous driving [11][24] Conclusion - FSDrive establishes an end-to-end visual reasoning pipeline that unifies future scene generation and perception results, effectively bridging the semantic gap caused by cross-modal conversions [24]