Vision-Language Model

Search documents
自动驾驶一周论文精选!端到端、VLA、感知、决策等~
自动驾驶之心· 2025-08-20 03:28
Core Viewpoint - The article emphasizes the recent advancements in autonomous driving research, highlighting various innovative approaches and frameworks that enhance the capabilities of autonomous systems in dynamic environments [2][4]. Group 1: End-to-End Autonomous Driving - The article discusses several notable papers focusing on end-to-end autonomous driving, including GMF-Drive, ME³-BEV, SpaRC-AD, IRL-VLA, and EvaDrive, which utilize advanced techniques such as gated fusion, deep reinforcement learning, and evolutionary adversarial strategies [8][10]. Group 2: Perception and VLM - The VISTA paper introduces a vision-language model for predicting driver attention in dynamic environments, showcasing the integration of visual and language processing for improved situational awareness [7]. - The article also mentions the development of safety-critical perception technologies, such as the progressive BEV perception survey and the CBDES MoE model for functional module decoupling [10]. Group 3: Simulation Testing - The article highlights the ReconDreamer-RL framework, which enhances reinforcement learning through diffusion-based scene reconstruction, indicating a trend towards more sophisticated simulation testing methodologies [11]. Group 4: Datasets - The STRIDE-QA dataset is introduced as a large-scale visual question answering resource aimed at spatiotemporal reasoning in urban driving scenarios, reflecting the growing need for comprehensive datasets in autonomous driving research [12].
基于开源Qwen2.5-VL实现自动驾驶VLM微调
自动驾驶之心· 2025-08-08 16:04
Core Viewpoint - The article discusses the advancements in autonomous driving technology, particularly focusing on the LLaMA Factory framework and the Qwen2.5-VL model, which enhance the capabilities of vision-language-action models for autonomous driving applications [4][5]. Group 1: LLaMA Factory Overview - LLaMA Factory is an open-source low-code framework for fine-tuning large models, gaining popularity in the open-source community with over 40,000 stars on GitHub [3]. - The framework integrates widely used fine-tuning techniques, making it suitable for developing autonomous driving assistants that can interpret traffic conditions through natural language [3]. Group 2: Qwen2.5-VL Model - The Qwen2.5-VL model serves as the foundational model for the project, achieving significant breakthroughs in visual recognition, object localization, document parsing, and long video understanding [4]. - It offers three model sizes, with the flagship Qwen2.5-VL-72B performing comparably to advanced models like GPT-4o and Claude 3.5 Sonnet, while smaller versions excel in resource-constrained environments [4]. Group 3: CoVLA Dataset - The CoVLA dataset, comprising 10,000 real driving scenes and over 80 hours of video, is utilized for training and evaluating vision-language-action models [5]. - This dataset surpasses existing datasets in scale and annotation richness, providing a comprehensive platform for developing safer and more reliable autonomous driving systems [5]. Group 4: Model Training and Testing - Instructions for downloading and installing LLaMA Factory and the Qwen2.5-VL model are provided, including commands for setting up the environment and testing the model [6][7]. - The article details the process of fine-tuning the model using the SwanLab tool for visual tracking of the training process, emphasizing the importance of adjusting parameters to avoid memory issues [11][17]. - After training, the fine-tuned model demonstrates improved response quality in dialogue scenarios related to autonomous driving risks compared to the original model [19].
上交&卡尔动力FastDrive!结构化标签实现端到端大模型更快更强~
自动驾驶之心· 2025-06-23 11:34
论文标题 : Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving 论文作者: Hao Jiang, Chuan Hu, Yukang Shi, Yuan He, Ke Wang, Xi Zhang, Zhipeng Zhang 论文链接: https://www.arxiv.org/pdf/2506.05442 作者 | Hao Jiang 来源 | 深蓝AI 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 >>点击进入→ 自动驾驶之心 『端到端自动驾驶』技术交流群 本文只做学术分享,如有侵权,联系删文 引言 最近将类人的推理能力融入到端到端自动驾驶系统中已经成为了一个前沿的研究领域。其中,基于 视觉语言模型的方法已经吸引了来自工业界和学术界的广泛关注。 现有的VLM训练范式严重依赖带有自由格式的文本标注数据集 ,如图1(a)所示。虽然这些描述 能够 捕捉丰富的语义信息,但 由于两种结构不同但是表达相近的句子会增加模型在学习任 ...