中科院&字节提出BridgeVLA！斩获CVPR 2025 workshop冠军~

Core Viewpoint - The article discusses the introduction of BridgeVLA, a new paradigm for 3D Vision-Language Action (VLA) models that enhances data efficiency and operational effectiveness in robotic manipulation tasks [3][21]. Group 1: Introduction of BridgeVLA - BridgeVLA integrates the strengths of existing 2D and 3D models by aligning inputs and outputs in a unified 2D space, thereby bridging the gap between Vision-Language Models (VLM) and VLA [5][21]. - The model has achieved a significant success rate of 88.2% in the RLBench benchmark, outperforming all existing baseline methods [14][19]. Group 2: Pre-training and Fine-tuning - The pre-training phase involves equipping the VLM with the ability to predict 2D Heatmaps from image-target text pairs, enhancing its target detection capabilities [8][10]. - During fine-tuning, BridgeVLA predicts actions by utilizing point clouds and instruction text, aligning the input with the pre-training phase to ensure consistency [11][12]. Group 3: Experimental Results - In RLBench, BridgeVLA improved the average success rate from 81.4% to 88.2%, particularly excelling in high-precision tasks [14][15]. - The model demonstrated robust performance in the COLOSSEUM benchmark, increasing the average success rate from 56.7% to 64.0% across various perturbations [16][19]. Group 4: Real-World Testing - In real-world evaluations, BridgeVLA outperformed the leading baseline RVT-2 in six out of seven settings, showcasing its robustness against visual disturbances [18][19]. - The model's ability to retain pre-training knowledge after fine-tuning indicates its effective learning and generalization capabilities [19]. Group 5: Future Directions - Future research will explore diverse pre-training tasks to enhance the model's general visual understanding and consider integrating more expressive action decoding methods to improve strategy performance [21]. - There is a plan to address the challenges of long-horizon tasks by utilizing large language models (LLMs) for task decomposition [21].