基于Qwen2.5-VL实现自动驾驶VLM的SFT

Core Insights - The article discusses the implementation of the LLaMA Factory framework for fine-tuning large models in the context of autonomous driving, utilizing a small dataset of 400 images and a GPU 3090 with 24GB memory [1][2]. Group 1: LLaMA Factory Overview - LLaMA Factory is an open-source low-code framework for fine-tuning large models, gaining popularity in the open-source community with over 40,000 stars on GitHub [1]. - The framework integrates widely used fine-tuning techniques and is designed to facilitate the training of models suitable for visual-language tasks in autonomous driving scenarios [1]. Group 2: Qwen2.5-VL Model - The Qwen2.5-VL model serves as the foundational model for the project, achieving significant breakthroughs in visual recognition, object localization, document parsing, and long video understanding [2]. - It offers three model sizes, with the flagship Qwen2.5-VL-72B performing comparably to advanced models like GPT-4o and Claude 3.5 Sonnet, while smaller versions excel in resource-constrained environments [2]. Group 3: CoVLA Dataset - The CoVLA dataset, comprising 10,000 real driving scenes and over 80 hours of video, is utilized for training and evaluating visual-language-action models [3]. - This dataset surpasses existing datasets in scale and annotation richness, providing a comprehensive platform for developing safer and more reliable autonomous driving systems [3]. Group 4: Model and Dataset Installation - Instructions for downloading and installing LLaMA Factory and the Qwen2.5-VL model are provided, including commands for cloning the repository and installing necessary dependencies [4][5][6]. - The CoVLA dataset can also be downloaded from Hugging Face, with configurations to speed up the download process [8][9]. Group 5: Fine-tuning Process - The fine-tuning process involves using the SwanLab tool for visual tracking of the training, with commands provided for installation and setup [14]. - After configuring parameters and starting the fine-tuning task, logs of the training process are displayed, and the fine-tuned model is saved for future use [17][20]. Group 6: Model Testing and Evaluation - Post-training, the fine-tuned model is tested through a web UI, allowing users to input questions related to autonomous driving risks and receive more relevant answers compared to the original model [22]. - The original model, while informative, may provide less relevant responses, highlighting the benefits of fine-tuning for specific applications [22].