VLA空间理解的能力还远未被挖掘!OccVLA的新尝试(上海期智&清华&上交等)
自动驾驶之心·2025-09-15 23:33

Core Insights - The article discusses the limitations of existing multimodal large language models (MLLMs) in robust 3D spatial understanding, which is crucial for autonomous driving [3][4] - It introduces OccVLA, a novel framework that integrates 3D occupancy representation into a unified multimodal reasoning process, enhancing the model's ability to learn fine-grained spatial structures from 2D visual inputs [3][9] Group 1: Introduction and Challenges - Recent advancements in end-to-end autonomous driving technology have highlighted the gap between 2D and 3D perception, which limits the widespread application of visual-language models (VLMs) in complex driving scenarios [4][5] - Two main challenges are identified: the difficulty in constructing usable and effective 3D representations without expensive manual annotations, and the lack of large-scale 3D visual-language pre-training that results in loss of fine-grained spatial details [5][8] Group 2: OccVLA Framework - OccVLA is designed to perform occupancy prediction, visual-language reasoning, and action generation tasks simultaneously, addressing the sparsity of occupancy representation and enhancing 3D understanding capabilities [9][18] - The framework employs a cross-attention mechanism to receive visual features from the VLM's intermediate layers, allowing for effective integration of occupancy tokens into the reasoning process without additional computational overhead [9][20] Group 3: Performance and Contributions - OccVLA has demonstrated superior performance in various perception and planning tasks, achieving state-of-the-art results on the nuScenes dataset for trajectory planning and 3D visual question answering [10][11] - The main contributions of the article include the introduction of the OccVLA framework, the design of a cross-modal attention mechanism that allows skipping the occupancy prediction process during inference, and the achievement of competitive results in trajectory planning tasks [11][36] Group 4: Experimental Results - The experiments utilized the nuScenes dataset, which includes 700 training scenes and 150 validation scenes, to evaluate the model's capabilities in 3D localization, target querying, and relational comparison tasks [35][36] - OccVLA's motion planning capabilities were compared with several baseline models, showing that it achieves optimal performance with only camera input and occupancy information as supervision, outperforming models that rely on more complex input data [37][38] Group 5: Visual Question Answering - The model was tested on the challenging NuScenes-QA benchmark dataset, demonstrating its ability to learn 3D understanding from pure visual input, surpassing larger models that depend on LiDAR data or explicit ground truth occupancy information [41][42] - The results indicate that OccVLA effectively integrates occupancy supervision to enhance its 3D reasoning capabilities in autonomous driving scenarios [41][45]