理想认为VLA语言比视觉对动作准确率影响更大

Core Viewpoint - The article discusses the release of DriveAction, a benchmark for evaluating Visual-Language-Action (VLA) models, emphasizing the need for both visual and language inputs to enhance action prediction accuracy [1][3]. Summary by Sections DriveAction Overview - DriveAction is the first action-driven benchmark specifically designed for VLA models, containing 16,185 question-answer pairs generated from 2,610 driving scenarios [3]. - The dataset is derived from real-world driving data collected from mass-produced assisted driving vehicles [3]. Model Performance Evaluation - The experiments indicate that the most advanced Visual-Language Models (VLMs) require guidance from both visual and language inputs for accurate action predictions. The average accuracy drops by 3.3% without visual input, 4.1% without language input, and 8.0% when both are absent [3][6]. - In comprehensive evaluation modes, all models achieved the highest accuracy in the full V-L-A mode, while the lowest accuracy was observed in the no-information mode (A) [6]. Specific Task Performance - Performance metrics for specific tasks such as navigation, efficiency, and dynamic/static tasks are provided, showing varying strengths among different models [8]. - For instance, GPT-4o scored 66.8 in navigation-related visual questions, 75.2 in language questions, and 78.2 in execution questions, highlighting the diverse capabilities of models [8]. Stability Analysis - Stability analysis was conducted by repeating each setting three times to calculate average values and standard deviations. GPT-4.1 mini and Gemini 2.5 Pro exhibited strong stability with standard deviations below 0.3 [9].