Workflow
Vision-Language Model (VLM)
icon
Search documents
最近做 VLA 的一些心得体会
自动驾驶之心· 2025-12-11 00:05
Core Insights - The article discusses the challenges and advancements in Vision-Language Models (VLM) for autonomous driving, highlighting issues such as hallucination, 3D spatial understanding, and processing speed [3]. Group 1: Challenges in VLM - Hallucination issues manifest as generating non-existent information and failing to perceive relevant data, which can be mitigated through dynamic perception techniques [3]. - Insufficient 3D spatial understanding is attributed to pre-training tasks being predominantly 2D, suggesting the incorporation of spatial localization tasks during training [3]. - Processing speed is a concern, with potential solutions including KV Cache, visual token compression, and mixed data training to enhance model efficiency [3]. Group 2: Learning Paradigms and Model Improvements - The learning paradigm should shift from imitation learning (SFT) to preference learning (DPO, GRPO), with simultaneous multi-task training yielding better results than sequential single-task training [3]. - To prevent catastrophic forgetting in foundation models, adding pre-training data is a simple and effective method [3]. - Enhanced supervisory signals can lead to better model representations, achieved by adding auxiliary task heads to the VLM model [3]. Group 3: Interaction and Evaluation - Current VLMs exhibit insufficient interaction between vision and language, limiting their effectiveness as base models; improving this interaction is crucial [3]. - The output method for trajectories is flexible, with various approaches yielding satisfactory results, though diffusion heads are preferred in industry for speed [3]. - Evaluation remains challenging due to inconsistencies between training and testing conditions, necessitating better alignment of objectives and data distributions [3].