3D VLA新范式！中科院&字节Seed提出BridgeVLA，斩获CVPR 2025 workshop冠军！

Core Viewpoint - The introduction of the BridgeVLA model represents a significant advancement in 3D Vision-Language Action (VLA) paradigms, achieving a high success rate in robotic manipulation tasks through efficient data usage and effective operation strategies [4][6][22]. Group 1: Model Development - The BridgeVLA model integrates the strengths of 2D and 3D VLA models, aiming for both efficiency and effectiveness in robotic operations [2][3]. - The core concept of BridgeVLA is to align the input and output of Vision-Language Models (VLM) and VLA in a unified 2D space, avoiding traditional 3D encoding methods [6][7]. - The model's output has shifted from Next Token Prediction to Heatmap Prediction, allowing for better spatial structure utilization and alignment in 2D space [7][10]. Group 2: Training Methodology - A novel scalable pre-training method is introduced, where the model learns to predict 2D Heatmaps from image-target text pairs, enhancing its object detection capabilities [8][10]. - The model uses a coarse-to-fine multi-stage prediction approach, refining predictions through iterative processing of point clouds [12]. Group 3: Experimental Results - In RLBench tasks, BridgeVLA significantly improved the average success rate from 81.4% to 88.2%, outperforming existing baseline methods [14]. - In the COLOSSEUM benchmark, BridgeVLA demonstrated robust performance, increasing the average success rate from 56.7% to 64.0% across various perturbations [16]. - The model achieved the highest average success rate in the GemBench evaluation, particularly excelling in L2 and L3 settings [17]. Group 4: Real-World Application - In real-world evaluations, BridgeVLA outperformed the leading baseline RVT-2 across six out of seven tested settings, showcasing its robustness against visual disturbances [19][20]. - The model's pre-training on 2D heatmaps proved crucial for understanding language semantics and generalizing to new object-skill combinations [20]. Group 5: Future Directions - Future research will explore diverse pre-training tasks to enhance the model's visual understanding capabilities [22]. - The integration of more expressive action decoding methods and the use of large language models for task decomposition are planned to improve performance in complex long-duration tasks [22].