SGDrive
Search documents
探寻世界模型最优解!SGDrive:层次化世界认知框架,VLA再升级(理想&复旦等)
自动驾驶之心· 2026-01-14 00:48
Core Insights - The article discusses the SGDrive framework, which integrates structured and hierarchical world knowledge into Visual-Language Models (VLM) for enhancing autonomous driving safety and reliability [3][52]. Group 1: Background and Motivation - Recent advancements in end-to-end (E2E) autonomous driving technologies have been significant, evolving from UniAD to SparseDrive, but existing methods often lack explicit causal reasoning and high-level scene understanding [6][12]. - The emergence of Large Language Models (LLM) and Visual-Language Models (VLM) has prompted researchers to integrate their rich prior knowledge and complex reasoning capabilities into driving tasks to address the shortcomings of traditional E2E methods [6][12]. Group 2: SGDrive Framework - SGDrive proposes a hierarchical world cognition framework that decomposes driving understanding into a scene-agent-goal structure, aligning with human driving cognition [3][15]. - The framework enhances VLM's 3D spatial perception by explicitly activating the model's ability to perceive and represent structured world knowledge, which is crucial for trajectory generation and collision avoidance [3][15]. Group 3: Methodology - The framework is modeled to solve two complementary sub-problems: extracting representative world knowledge and predicting future world states [16]. - A set of special query tokens is introduced to guide the model's attention towards driving-relevant knowledge and predict its future evolution [17][20]. Group 4: Experimental Results - SGDrive achieved state-of-the-art (SOTA) performance on the NAVSIM benchmark, surpassing larger general VLMs and previous leading driving VLM methods, demonstrating the effectiveness of hierarchical world knowledge learning [40][41]. - The model outperformed existing methods in key collision-related metrics, validating the hypothesis that explicit predictions of spatiotemporal layouts and dynamic agent interactions enhance safety [40][41]. Group 5: Ablation Studies - Ablation studies indicate that the hierarchical world representation significantly improves the model's understanding of the 3D driving environment, leading to more accurate trajectory predictions [42]. - The structured attention mechanism effectively prevents information leakage and cross-category noise, resulting in clearer and more task-specific embeddings [45].