3D VLA新范式！CVPR冠军方案BridgeVLA，真机性能提升32%

Core Viewpoint - The article discusses the BridgeVLA model developed by the Institute of Automation, Chinese Academy of Sciences, which efficiently combines 3D input projection into 2D images for action prediction, achieving high performance and data efficiency in 3D robotic operation learning [4][6]. Group 1: Model Performance - BridgeVLA achieves a task success rate of 96.8% with only 3 trajectories in basic settings, demonstrating superior performance in various generalization settings compared to baseline models, with a 32% performance improvement [6][25]. - In simulation benchmarks such as RLBench, COLOSSEUM, and GemBench, BridgeVLA outperforms mainstream 3D robotic operation benchmarks, achieving an 88.2% success rate in RLBench, a 7.3% improvement in COLOSSEUM, and a 50% success rate in GemBench [20][25]. Group 2: Model Design and Training - BridgeVLA's training process consists of two phases: 2D heatmap pre-training to enhance spatial perception and 3D action fine-tuning to learn specific robotic operation strategies [15][17]. - The model utilizes a heatmap pre-training method to predict the probability heatmap of target object locations based on textual instructions, enhancing its spatial awareness [16][25]. Group 3: Generalization and Data Efficiency - BridgeVLA demonstrates strong generalization capabilities, effectively handling various disturbances such as unseen objects, lighting conditions, and object types, thanks to the rich visual and linguistic prior knowledge embedded in the pre-trained multimodal model [20][25]. - The model's high data efficiency is highlighted by its ability to achieve nearly the same performance with only 3 trajectories as with 10 trajectories, making it suitable for deployment in real robotic systems [25][26].