Workflow
DriveAction
icon
Search documents
理想认为VLA语言比视觉对动作准确率影响更大
理想TOP2· 2025-08-16 12:11
Core Viewpoint - The article discusses the release of DriveAction, a benchmark for evaluating Visual-Language-Action (VLA) models, emphasizing the need for both visual and language inputs to enhance action prediction accuracy [1][3]. Summary by Sections DriveAction Overview - DriveAction is the first action-driven benchmark specifically designed for VLA models, containing 16,185 question-answer pairs generated from 2,610 driving scenarios [3]. - The dataset is derived from real-world driving data collected from mass-produced assisted driving vehicles [3]. Model Performance Evaluation - The experiments indicate that the most advanced Visual-Language Models (VLMs) require guidance from both visual and language inputs for accurate action predictions. The average accuracy drops by 3.3% without visual input, 4.1% without language input, and 8.0% when both are absent [3][6]. - In comprehensive evaluation modes, all models achieved the highest accuracy in the full V-L-A mode, while the lowest accuracy was observed in the no-information mode (A) [6]. Specific Task Performance - Performance metrics for specific tasks such as navigation, efficiency, and dynamic/static tasks are provided, showing varying strengths among different models [8]. - For instance, GPT-4o scored 66.8 in navigation-related visual questions, 75.2 in language questions, and 78.2 in execution questions, highlighting the diverse capabilities of models [8]. Stability Analysis - Stability analysis was conducted by repeating each setting three times to calculate average values and standard deviations. GPT-4.1 mini and Gemini 2.5 Pro exhibited strong stability with standard deviations below 0.3 [9].
自动驾驶端到端VLA落地,算法如何设计?
自动驾驶之心· 2025-06-22 14:09
Core Insights - The article discusses the rapid advancements in end-to-end autonomous driving, particularly focusing on Vision-Language-Action (VLA) models and their applications in the industry [2][3]. Group 1: VLA Model Developments - The introduction of AutoVLA, a new VLA model that integrates reasoning and action generation for end-to-end autonomous driving, shows promising results in semantic reasoning and trajectory planning [3][4]. - ReCogDrive, another VLA model, addresses performance issues in rare and long-tail scenarios by utilizing a three-stage training framework that combines visual language models with diffusion planners [7][9]. - Impromptu VLA introduces a dataset aimed at improving VLA models' performance in unstructured extreme conditions, demonstrating significant performance improvements in established benchmarks [14][24]. Group 2: Experimental Results - AutoVLA achieved competitive performance metrics in various scenarios, with the best-of-N method reaching a PDMS score of 92.12, indicating its effectiveness in planning and execution [5]. - ReCogDrive set a new state-of-the-art PDMS score of 89.6 on the NAVSIM benchmark, showcasing its robustness and safety in driving trajectories [9][10]. - The OpenDriveVLA model demonstrated superior results in open-loop trajectory planning and driving-related question-answering tasks, outperforming previous methods on the nuScenes dataset [28][32]. Group 3: Industry Trends - The article highlights a trend among major automotive manufacturers, such as Li Auto, Xiaomi, and XPeng, to invest heavily in VLA model research and development, indicating a competitive landscape in autonomous driving technology [2][3]. - The integration of large language models (LLMs) with VLA frameworks is becoming a focal point for enhancing decision-making capabilities in autonomous vehicles, as seen in models like ORION and VLM-RL [33][39].