Vision-Language Model - filings, earnings calls, financial reports, news

Vision-Language Model

Search documents

无需标注图像，VLM也能「自我进化」！RL自我进化框架VisPlay突破视觉推理难题

机器之心· 2025-12-01 04:06

Title：VisPlay: Self-Evolving Vision-Language Models from Images 实验证明，VisPlay 在 Qwen2.5-VL 和 MiMo-VL 等主流模型上实现了持续的性能提升，尤其在视觉推理、组合泛化和幻觉减少方面效果显著，展示了一条可扩展、低成本的多模态智能进化新路径。在 Vision-Language Model 领域，提升其复杂推理能力通常依赖于耗费巨大的人工标注数据或启发式奖励。这不仅成本高昂，且难以规模化。最新研究 VisPlay 首次提出了一个自进化强化学习框架，使 VLM 能够仅通过海量的未标注图像数据进行自我演化和能力提升。 VisPlay 将基础 VLM 分解为「提问者」和「推理者」两大角色，通过迭代的自我进化机制协同进化，并结合 GRPO 算法和创新的多样性/难度奖励，平衡了问题的复杂度和答案的质量。引言： Paper: https://arxiv.org/abs/2511.15661 Github: https://github.com/bruno686/VisPlay VLM 推理能力的「数据困境」近年来，Visio ...

Vision-Language Model

Vision-Language Model

自动驾驶一周论文精选！端到端、VLA、感知、决策等~

自动驾驶之心· 2025-08-20 03:28

Core Viewpoint - The article emphasizes the recent advancements in autonomous driving research, highlighting various innovative approaches and frameworks that enhance the capabilities of autonomous systems in dynamic environments [2][4]. Group 1: End-to-End Autonomous Driving - The article discusses several notable papers focusing on end-to-end autonomous driving, including GMF-Drive, ME³-BEV, SpaRC-AD, IRL-VLA, and EvaDrive, which utilize advanced techniques such as gated fusion, deep reinforcement learning, and evolutionary adversarial strategies [8][10]. Group 2: Perception and VLM - The VISTA paper introduces a vision-language model for predicting driver attention in dynamic environments, showcasing the integration of visual and language processing for improved situational awareness [7]. - The article also mentions the development of safety-critical perception technologies, such as the progressive BEV perception survey and the CBDES MoE model for functional module decoupling [10]. Group 3: Simulation Testing - The article highlights the ReconDreamer-RL framework, which enhances reinforcement learning through diffusion-based scene reconstruction, indicating a trend towards more sophisticated simulation testing methodologies [11]. Group 4: Datasets - The STRIDE-QA dataset is introduced as a large-scale visual question answering resource aimed at spatiotemporal reasoning in urban driving scenarios, reflecting the growing need for comprehensive datasets in autonomous driving research [12].

Autonomous Driving

Vision-Language Model

Autonomous Driving

Autonomous Driving Technology

Autonomous Driving

Vision-Language Model

Autonomous Driving

Autonomous Driving Technology

基于开源Qwen2.5-VL实现自动驾驶VLM微调

自动驾驶之心· 2025-08-08 16:04

Core Viewpoint - The article discusses the advancements in autonomous driving technology, particularly focusing on the LLaMA Factory framework and the Qwen2.5-VL model, which enhance the capabilities of vision-language-action models for autonomous driving applications [4][5]. Group 1: LLaMA Factory Overview - LLaMA Factory is an open-source low-code framework for fine-tuning large models, gaining popularity in the open-source community with over 40,000 stars on GitHub [3]. - The framework integrates widely used fine-tuning techniques, making it suitable for developing autonomous driving assistants that can interpret traffic conditions through natural language [3]. Group 2: Qwen2.5-VL Model - The Qwen2.5-VL model serves as the foundational model for the project, achieving significant breakthroughs in visual recognition, object localization, document parsing, and long video understanding [4]. - It offers three model sizes, with the flagship Qwen2.5-VL-72B performing comparably to advanced models like GPT-4o and Claude 3.5 Sonnet, while smaller versions excel in resource-constrained environments [4]. Group 3: CoVLA Dataset - The CoVLA dataset, comprising 10,000 real driving scenes and over 80 hours of video, is utilized for training and evaluating vision-language-action models [5]. - This dataset surpasses existing datasets in scale and annotation richness, providing a comprehensive platform for developing safer and more reliable autonomous driving systems [5]. Group 4: Model Training and Testing - Instructions for downloading and installing LLaMA Factory and the Qwen2.5-VL model are provided, including commands for setting up the environment and testing the model [6][7]. - The article details the process of fine-tuning the model using the SwanLab tool for visual tracking of the training process, emphasizing the importance of adjusting parameters to avoid memory issues [11][17]. - After training, the fine-tuned model demonstrates improved response quality in dialogue scenarios related to autonomous driving risks compared to the original model [19].

Large Model Fine-tuning

Vision-Language Model

Artificial Intelligence

Autonomous Driving

Qwen2.5-VL

LLaMA Factory

Large Model Fine-tuning

Vision-Language Model

Artificial Intelligence

Autonomous Driving

Qwen2.5-VL

LLaMA Factory

上交&卡尔动力FastDrive！结构化标签实现端到端大模型更快更强~

自动驾驶之心· 2025-06-23 11:34

Core Viewpoint - The integration of human-like reasoning capabilities into end-to-end autonomous driving systems is a cutting-edge research area, with a focus on vision-language models (VLMs) [1]. Group 1: Structured Dataset and Model - A structured dataset called NuScenes-S has been introduced, which focuses on key elements closely related to driving decisions, eliminating redundant information and improving reasoning efficiency [4][5]. - The FastDrive model, with 0.9 billion parameters, mimics human reasoning strategies and effectively aligns with end-to-end autonomous driving frameworks [4][5]. Group 2: Dataset Description - The NuScenes-S dataset provides a comprehensive view of driving scenarios, addressing issues often overlooked in existing datasets. It includes key elements such as weather, traffic conditions, driving areas, traffic lights, traffic signs, road conditions, lane markings, and time [7][8]. - The dataset construction involved annotating scene information using both GPT and human input, refining the results through comparison and optimization [9]. Group 3: FastDrive Algorithm Model - The FastDrive model follows the "ViT-Adapter-LLM" architecture, utilizing a Vision Transformer for visual feature extraction and a token-packing module to enhance inference speed [18][19]. - The model employs a large language model (LLM) to generate scene descriptions, identify key objects, predict future states, and make driving decisions in a reasoning chain manner [19]. Group 4: Experimental Results - Experiments conducted on the NuScenes-S dataset, which contains 102,000 question-answer pairs, demonstrated that FastDrive achieved competitive performance in scene understanding tasks [21]. - The performance metrics for FastDrive showed strong results in perception, prediction, and decision-making tasks, outperforming other models [25].

End-to-End Autonomous Driving

Vision-Language Model

Autonomous Driving

FastDrive

NuScenes-S

End-to-End Autonomous Driving

Vision-Language Model

Autonomous Driving

FastDrive

NuScenes-S