Workflow
LMAD框架
icon
Search documents
复旦最新LMAD:迈向可解释端到端VLM~
自动驾驶之心· 2025-08-19 23:32
Core Viewpoint - The article discusses the LMAD framework, which significantly enhances the reasoning performance of visual language models (VLMs) in autonomous driving by addressing existing limitations in scene understanding and spatial perception [2][3]. Existing Method Limitations - Current VLM-based autonomous driving methods face two key issues: fragmented scene understanding, which relies on intermediate results and fails to capture relationships between traffic elements, and weak spatial and motion perception, leading to accumulated errors during inference [4]. Innovations of LMAD - The LMAD framework introduces several core innovations: - Preliminary Interaction (PI) mechanism to model initial relationships among traffic participants, reducing the learning complexity of VLMs [6]. - Task-specific expert structure using parallel LoRA (P-LoRA) modules to focus VLMs on specific tasks such as perception, prediction, and planning [6]. - End-to-end system integration that incorporates prior knowledge from end-to-end driving systems to enhance spatial and motion information for improved reasoning capabilities [6]. Overall Framework - LMAD integrates an end-to-end driving pipeline with visual language models, consisting of three main components: a visual language model for image and text token processing, a PI encoder for multi-view image handling, and a P-LoRA module for task-specific knowledge integration [8][10]. Key Module Design - The PI encoder addresses redundancy in multi-view image processing by employing a decoupled query and alternating attention mechanism [12][15]. - The P-LoRA design allows for multiple parallel branches corresponding to different driving tasks, enhancing adaptability [16]. Training Strategy - The training strategy includes single-branch fine-tuning, where only the language branch is adjusted, and joint training, which optimizes both text generation and end-to-end tasks simultaneously [18]. Experimental Results - In the DriveLM benchmark, LMAD significantly improved the performance of baseline VLMs, with accuracy increases of 3.44% for LLaMA-Adapter and 3.89% for GPT [20]. - In the nuScenes-QA test, LMAD achieved an overall accuracy improvement of 2.57% compared to the baseline [25]. Ablation Studies - The effectiveness of components such as the PI encoder, P-LoRA, and end-to-end tokens was confirmed, with the full configuration yielding the highest final score of 57.17 [28]. - The task-oriented P-LoRA design outperformed other configurations across various metrics [28]. Qualitative Analysis - LMAD demonstrated strong performance in perception tasks by accurately identifying key targets, although it struggled with less obvious signs [34]. - In prediction tasks, LMAD effectively influenced subsequent planning despite discrepancies between predicted and actual targets [34]. - For planning tasks, LMAD produced driving behaviors that aligned with the current environment by leveraging historical context [34].