港大领衔DrivePI：统一自动驾驶理解、感知、预测和规划的空间智能4D MLLM

Core Viewpoint - DrivePI is introduced as a novel unified spatial-aware 4D multimodal large language model (MLLM) framework that integrates coarse-grained language understanding with fine-grained 3D perception capabilities, bridging the gap between vision-based and VLA paradigms in autonomous driving [2][38]. Group 1: Project Overview - DrivePI is developed collaboratively by Hong Kong University, leading the project with contributions from companies like Huawei and universities such as Tianjin University and Huazhong University of Science and Technology [2]. - The model is designed to perform spatial understanding, 3D perception, prediction, and planning tasks through end-to-end optimization, showcasing its capability to handle complex autonomous driving scenarios [4][6]. Group 2: Technical Innovations - DrivePI incorporates a multimodal perception approach, utilizing LiDAR alongside camera images to enhance spatial understanding and provide accurate 3D geometric information [11]. - The model generates intermediate fine 3D perception and prediction representations, ensuring reliable spatial awareness and enhancing the interpretability and safety of autonomous driving systems [11]. - A rich data engine is developed to seamlessly integrate 3D occupancy and flow representations into natural language scene descriptions, allowing the model to understand complex spatiotemporal dynamics [11]. Group 3: Performance Metrics - DrivePI outperforms existing VLA models, achieving a 2.5% higher average accuracy on nuScenes-QA compared to OpenDriveVLA-7B and reducing collision rates by 70% from 0.37% to 0.11% [5][16]. - In 3D occupancy and flow prediction, DrivePI achieved 49.3% OccScore and 49.3% RayIoU, surpassing the FB-OCC method by 10.3 percentage points [15][21]. - The model demonstrated a 32% reduction in L2 error for trajectory planning compared to VAD, showcasing its effectiveness in planning tasks [16]. Group 4: Data Engine and Annotation - The data engine for DrivePI operates in three main stages, focusing on generating diverse question-answer pairs for 4D spatial understanding and planning reasoning [12][18]. - Scene understanding annotations are generated to avoid confusion in distinguishing different views, enhancing the model's ability to interpret various perspectives [18]. Group 5: Ablation Studies and Insights - Ablation studies indicate that combining text and visual heads improves performance across most tasks, demonstrating the effectiveness of unifying text understanding with 3D perception, prediction, and planning [23]. - The impact of different text data scales was explored, revealing significant improvements in occupancy state prediction accuracy when increasing the training data size [26]. Group 6: Future Prospects - DrivePI is expected to inspire future research directions in autonomous driving by enhancing the interpretability and decision-making capabilities of systems through language reasoning and detailed 3D outputs [38].