VLA集体翻车？复旦&创智邱锡鹏教授团队提出LIBERO-Plus，揭示VLA脆弱性真相

Core Insights - The article discusses the robustness analysis of Vision-Language-Action (VLA) models, revealing significant generalization deficiencies despite high performance scores in ideal conditions [2][4][6] - The LIBERO-Plus framework is introduced to systematically evaluate VLA models across various perturbation dimensions, highlighting the gap between surface performance and actual generalization capabilities [4][6][33] Group 1: Motivation and Contributions - VLA models have achieved impressive success rates in benchmarks like LIBERO, but existing evaluation methods fail to assess stability and reliability under real-world variations [4][6] - LIBERO-Plus evaluates models based on seven dimensions of perturbation: object placement, camera angle, robot initial pose, language instructions, lighting conditions, background textures, and sensor noise [4][6] - The framework provides a detailed analysis of VLA models' generalization performance through systematic perturbation [4][6] Group 2: Performance Analysis - The analysis reveals that VLA models exhibit significant overall vulnerability to perturbations, with performance declining across all dimensions [13][32] - Models are most sensitive to changes in camera perspective and robot initial state, indicating a need for high-level spatial and proprioceptive understanding [13][32] - Language perturbations lead to the smallest average performance drop (-25.3%), suggesting a surprising level of robustness that warrants further investigation [15][17] Group 3: Findings on Model Behavior - Some models maintain performance even with empty language inputs, indicating a tendency to ignore language modalities and behave more like visual-action (VA) models [16][19] - VLA models struggle with cross-object instruction following, relying more on fixed visual-action mappings rather than fully leveraging language signals [19][20] - The models demonstrate remarkable adaptability to background changes while showing limited sensitivity to lighting variations, raising questions about the representations they learn [20][27] Group 4: Combination Generalization - The concept of "combination generalization gap" is introduced, highlighting the negative interactions between different perturbations that exceed the independent effects of single perturbations [29][32] - The analysis indicates that current VLA models lack the ability to effectively handle complex multi-dimensional perturbations due to entangled representations [32] Group 5: LIBERO-Plus Benchmark - The LIBERO-Plus benchmark consists of 10,030 tasks designed to evaluate model performance under various perturbations, constructed using perturbation augmentation strategies [33][36] - The benchmark features include comprehensive coverage of seven perturbation dimensions and fine-grained difficulty levels [36] - Models trained with enhanced data achieved an average success rate of 79.6% on LIBERO-Plus, significantly outperforming baseline models [38]