Workflow
Vision-Language Model
icon
Search documents
自动驾驶一周论文精选!端到端、VLA、感知、决策等~
自动驾驶之心· 2025-08-20 03:28
Core Viewpoint - The article emphasizes the recent advancements in autonomous driving research, highlighting various innovative approaches and frameworks that enhance the capabilities of autonomous systems in dynamic environments [2][4]. Group 1: End-to-End Autonomous Driving - The article discusses several notable papers focusing on end-to-end autonomous driving, including GMF-Drive, ME³-BEV, SpaRC-AD, IRL-VLA, and EvaDrive, which utilize advanced techniques such as gated fusion, deep reinforcement learning, and evolutionary adversarial strategies [8][10]. Group 2: Perception and VLM - The VISTA paper introduces a vision-language model for predicting driver attention in dynamic environments, showcasing the integration of visual and language processing for improved situational awareness [7]. - The article also mentions the development of safety-critical perception technologies, such as the progressive BEV perception survey and the CBDES MoE model for functional module decoupling [10]. Group 3: Simulation Testing - The article highlights the ReconDreamer-RL framework, which enhances reinforcement learning through diffusion-based scene reconstruction, indicating a trend towards more sophisticated simulation testing methodologies [11]. Group 4: Datasets - The STRIDE-QA dataset is introduced as a large-scale visual question answering resource aimed at spatiotemporal reasoning in urban driving scenarios, reflecting the growing need for comprehensive datasets in autonomous driving research [12].
基于开源Qwen2.5-VL实现自动驾驶VLM微调
自动驾驶之心· 2025-08-08 16:04
Core Viewpoint - The article discusses the advancements in autonomous driving technology, particularly focusing on the LLaMA Factory framework and the Qwen2.5-VL model, which enhance the capabilities of vision-language-action models for autonomous driving applications [4][5]. Group 1: LLaMA Factory Overview - LLaMA Factory is an open-source low-code framework for fine-tuning large models, gaining popularity in the open-source community with over 40,000 stars on GitHub [3]. - The framework integrates widely used fine-tuning techniques, making it suitable for developing autonomous driving assistants that can interpret traffic conditions through natural language [3]. Group 2: Qwen2.5-VL Model - The Qwen2.5-VL model serves as the foundational model for the project, achieving significant breakthroughs in visual recognition, object localization, document parsing, and long video understanding [4]. - It offers three model sizes, with the flagship Qwen2.5-VL-72B performing comparably to advanced models like GPT-4o and Claude 3.5 Sonnet, while smaller versions excel in resource-constrained environments [4]. Group 3: CoVLA Dataset - The CoVLA dataset, comprising 10,000 real driving scenes and over 80 hours of video, is utilized for training and evaluating vision-language-action models [5]. - This dataset surpasses existing datasets in scale and annotation richness, providing a comprehensive platform for developing safer and more reliable autonomous driving systems [5]. Group 4: Model Training and Testing - Instructions for downloading and installing LLaMA Factory and the Qwen2.5-VL model are provided, including commands for setting up the environment and testing the model [6][7]. - The article details the process of fine-tuning the model using the SwanLab tool for visual tracking of the training process, emphasizing the importance of adjusting parameters to avoid memory issues [11][17]. - After training, the fine-tuned model demonstrates improved response quality in dialogue scenarios related to autonomous driving risks compared to the original model [19].
上交&卡尔动力FastDrive!结构化标签实现端到端大模型更快更强~
自动驾驶之心· 2025-06-23 11:34
Core Viewpoint - The integration of human-like reasoning capabilities into end-to-end autonomous driving systems is a cutting-edge research area, with a focus on vision-language models (VLMs) [1]. Group 1: Structured Dataset and Model - A structured dataset called NuScenes-S has been introduced, which focuses on key elements closely related to driving decisions, eliminating redundant information and improving reasoning efficiency [4][5]. - The FastDrive model, with 0.9 billion parameters, mimics human reasoning strategies and effectively aligns with end-to-end autonomous driving frameworks [4][5]. Group 2: Dataset Description - The NuScenes-S dataset provides a comprehensive view of driving scenarios, addressing issues often overlooked in existing datasets. It includes key elements such as weather, traffic conditions, driving areas, traffic lights, traffic signs, road conditions, lane markings, and time [7][8]. - The dataset construction involved annotating scene information using both GPT and human input, refining the results through comparison and optimization [9]. Group 3: FastDrive Algorithm Model - The FastDrive model follows the "ViT-Adapter-LLM" architecture, utilizing a Vision Transformer for visual feature extraction and a token-packing module to enhance inference speed [18][19]. - The model employs a large language model (LLM) to generate scene descriptions, identify key objects, predict future states, and make driving decisions in a reasoning chain manner [19]. Group 4: Experimental Results - Experiments conducted on the NuScenes-S dataset, which contains 102,000 question-answer pairs, demonstrated that FastDrive achieved competitive performance in scene understanding tasks [21]. - The performance metrics for FastDrive showed strong results in perception, prediction, and decision-making tasks, outperforming other models [25].