自动驾驶大模型方案：视觉语言模型VLM工作一览，面向量产和研究~

Core Insights - The article emphasizes the transformative potential of Vision-Language Models (VLMs) in enhancing the perception and cognitive capabilities of autonomous driving systems, enabling them to not only "see" but also "understand" complex driving environments [2][3]. Group 1: VLM Applications in Autonomous Driving - VLMs can surpass traditional visual models by integrating camera images or video streams to comprehend semantic information in traffic scenes, such as recognizing complex scenarios like "a pedestrian waving to cross the street" [6]. - VLMs facilitate the conversion of intricate visual scenes into clear natural language descriptions, enhancing the interpretability of decisions made by autonomous systems, which aids in debugging and increases trust among passengers and regulators [6]. - VLMs are crucial for natural language interactions in future smart cabins, allowing passengers to communicate intentions to vehicles through spoken commands [6]. Group 2: Scenario Generation and Testing - The article introduces CrashAgent, a multi-agent framework that utilizes multi-modal large language models to convert accident reports into structured scenarios for simulation environments, addressing the long-tail distribution issue in existing datasets [7]. - CurricuVLM is proposed as a personalized curriculum learning framework that leverages VLMs to analyze agent behavior and dynamically generate tailored training scenarios, improving safety in autonomous driving [13]. - TRACE is a framework that generates key test cases from real accident reports, significantly enhancing the efficiency of defect detection in autonomous driving systems [17]. Group 3: Out-of-Distribution (OOD) Scenario Generation - A framework utilizing large language models is proposed to generate diverse OOD driving scenarios, addressing the challenges posed by the sparsity of such scenarios in urban driving datasets [21][22]. - The article discusses the development of a method to automatically convert real-world driving videos into detailed simulation scenarios, enhancing the testing of autonomous driving systems [26]. Group 4: Enhancing Safety and Robustness - WEDGE is introduced as a synthetic dataset created from generative vision-language models, aimed at improving the robustness of perception systems in extreme weather conditions [39][40]. - LKAlert is a predictive alert system that utilizes VLMs to forecast potential lane-keeping assist (LKA) risks, enhancing driver situational awareness and trust [54][55]. Group 5: Advancements in Decision-Making Frameworks - The CBR-LLM framework combines semantic scene understanding with case retrieval to enhance decision-making in complex driving scenarios, improving accuracy and reasoning consistency [44][45]. - ORION is presented as a holistic end-to-end autonomous driving framework that integrates visual-language instructed action generation, achieving superior performance in closed-loop evaluations [69][70].