Research Background and Motivation - The Vision-Language-Action models (VLAs) are rapidly evolving towards general robotic strategies, achieving capabilities such as cross-carrier generalization, dexterous manipulation, and instruction following. However, there is a lack of quantitative understanding regarding the capability boundaries, limitations, and failure modes of these models, with existing benchmarks having three core deficiencies [1][4]. Core Design: Structured Tasks and Benchmark Framework - The VLA-Arena framework is proposed to address the aforementioned issues, aiming to systematically design and accurately characterize the capability frontiers and failure mechanisms of VLA models [1][4]. - The benchmark includes 170 tasks categorized into four dimensions, covering difficulty levels from L0 to L2 [6]. Key Components and Technical Details - The framework enhances the Behavior Domain Definition Language (BDDL) to create the Constraint Behavior Domain Definition Language (CBDDL), focusing on two core enhancements [6][7]. - The VLA-Arena-S/M/L datasets are provided, categorized by task level (L0/L1) and trajectory count (10/30/50 per task), constructed from human demonstration data with preprocessing steps to ensure reproducibility [8]. Experimental Design and Main Findings - The experimental setup evaluates models across two architectural paradigms, including autoregressive models and continuous action generation models, using success rate (SR) and cumulative cost (CC) as evaluation metrics [12][13]. - Key findings indicate that: 1. Models exhibit a strong tendency to memorize rather than generalize, with performance drastically declining in L1 and L2 tasks [14]. 2. There is an asymmetry in robustness, where models are generally resilient to language perturbations but vulnerable to visual disturbances [15]. 3. A trade-off exists between safety and performance, with models struggling to integrate safety constraints effectively [16]. 4. The ability to handle distractors varies, with static distractors posing greater challenges than dynamic ones, and models failing in long-horizon tasks [19]. 5. Increasing data diversity can enhance near-distribution performance but may harm far-distribution generalization capabilities [17]. Comparison with LIBERO Benchmark - The VLA-Arena tasks require deeper language understanding compared to LIBERO, where performance declines less significantly in the absence of instructions, indicating a more robust semantic grounding in real-world scenarios [22].
VLA-Arena:一个用于系统性评估VLA的开源基准框架
具身智能之心·2025-12-31 00:50