Workflow
3D VLA
icon
Search documents
中科院&字节提出BridgeVLA!斩获CVPR 2025 workshop冠军~
自动驾驶之心· 2025-06-28 13:34
Core Viewpoint - The article discusses the introduction of BridgeVLA, a new paradigm for 3D Vision-Language Action (VLA) models that enhances data efficiency and operational effectiveness in robotic manipulation tasks [3][21]. Group 1: Introduction of BridgeVLA - BridgeVLA integrates the strengths of existing 2D and 3D models by aligning inputs and outputs in a unified 2D space, thereby bridging the gap between Vision-Language Models (VLM) and VLA [5][21]. - The model has achieved a significant success rate of 88.2% in the RLBench benchmark, outperforming all existing baseline methods [14][19]. Group 2: Pre-training and Fine-tuning - The pre-training phase involves equipping the VLM with the ability to predict 2D Heatmaps from image-target text pairs, enhancing its target detection capabilities [8][10]. - During fine-tuning, BridgeVLA predicts actions by utilizing point clouds and instruction text, aligning the input with the pre-training phase to ensure consistency [11][12]. Group 3: Experimental Results - In RLBench, BridgeVLA improved the average success rate from 81.4% to 88.2%, particularly excelling in high-precision tasks [14][15]. - The model demonstrated robust performance in the COLOSSEUM benchmark, increasing the average success rate from 56.7% to 64.0% across various perturbations [16][19]. Group 4: Real-World Testing - In real-world evaluations, BridgeVLA outperformed the leading baseline RVT-2 in six out of seven settings, showcasing its robustness against visual disturbances [18][19]. - The model's ability to retain pre-training knowledge after fine-tuning indicates its effective learning and generalization capabilities [19]. Group 5: Future Directions - Future research will explore diverse pre-training tasks to enhance the model's general visual understanding and consider integrating more expressive action decoding methods to improve strategy performance [21]. - There is a plan to address the challenges of long-horizon tasks by utilizing large language models (LLMs) for task decomposition [21].
3D VLA新范式!中科院&字节Seed提出BridgeVLA,斩获CVPR 2025 workshop冠军!
机器之心· 2025-06-24 01:46
Core Viewpoint - The introduction of the BridgeVLA model represents a significant advancement in 3D Vision-Language Action (VLA) paradigms, achieving a high success rate in robotic manipulation tasks through efficient data usage and effective operation strategies [4][6][22]. Group 1: Model Development - The BridgeVLA model integrates the strengths of 2D and 3D VLA models, aiming for both efficiency and effectiveness in robotic operations [2][3]. - The core concept of BridgeVLA is to align the input and output of Vision-Language Models (VLM) and VLA in a unified 2D space, avoiding traditional 3D encoding methods [6][7]. - The model's output has shifted from Next Token Prediction to Heatmap Prediction, allowing for better spatial structure utilization and alignment in 2D space [7][10]. Group 2: Training Methodology - A novel scalable pre-training method is introduced, where the model learns to predict 2D Heatmaps from image-target text pairs, enhancing its object detection capabilities [8][10]. - The model uses a coarse-to-fine multi-stage prediction approach, refining predictions through iterative processing of point clouds [12]. Group 3: Experimental Results - In RLBench tasks, BridgeVLA significantly improved the average success rate from 81.4% to 88.2%, outperforming existing baseline methods [14]. - In the COLOSSEUM benchmark, BridgeVLA demonstrated robust performance, increasing the average success rate from 56.7% to 64.0% across various perturbations [16]. - The model achieved the highest average success rate in the GemBench evaluation, particularly excelling in L2 and L3 settings [17]. Group 4: Real-World Application - In real-world evaluations, BridgeVLA outperformed the leading baseline RVT-2 across six out of seven tested settings, showcasing its robustness against visual disturbances [19][20]. - The model's pre-training on 2D heatmaps proved crucial for understanding language semantics and generalizing to new object-skill combinations [20]. Group 5: Future Directions - Future research will explore diverse pre-training tasks to enhance the model's visual understanding capabilities [22]. - The integration of more expressive action decoding methods and the use of large language models for task decomposition are planned to improve performance in complex long-duration tasks [22].