中科第五纪联合中科院自动化所团队推出 BridgeV2W,让机器人学会"预演未来"
机器人大讲堂·2026-02-12 09:15

Core Insights - The article discusses the development of BridgeV2W, a system designed to enhance robots' predictive capabilities by allowing them to simulate actions before execution, bridging the gap between video generation models and embodied world models [1][20]. Group 1: Challenges in Embodied World Models - Current embodied world models face three main challenges: the language barrier between robot actions (joint angles and positions) and video generation models (pixels), leading to difficulties in understanding and prediction [3][4]. - The appearance of the same action can vary significantly from different camera angles, causing a drop in prediction quality when the viewpoint changes [3]. - Different robot structures require unique model architectures, making it hard to create a unified world model [4]. Group 2: Innovations of BridgeV2W - BridgeV2W introduces the concept of an "Embodiment Mask," which renders robot actions as binary silhouettes in images, effectively mapping action coordinates to pixel space [5][6]. - This design addresses the aforementioned challenges by integrating the mask as a conditional signal in pre-trained video generation models, enhancing their ability to understand robot actions while maintaining strong visual priors [6]. Group 3: Experimental Validation - The research team validated BridgeV2W across various settings, including different robot platforms and unseen viewpoints, demonstrating its robustness and adaptability [7][8]. - The DROID dataset, one of the largest real-world robot operation datasets, showed BridgeV2W outperforming state-of-the-art methods in key metrics like PSNR and SSIM, particularly in unseen viewpoint tests [8][10]. Group 4: Practical Applications - BridgeV2W can utilize vast amounts of unannotated human video data for training, allowing for scalable and effective world model training without the need for extensive calibration [14][15]. - The system can evaluate strategies in the world model without real robots, significantly reducing the cost of strategy iteration [14]. - It can also plan actions based on target images, enabling a closed-loop from visual goals to physical actions [14]. Group 5: Future Prospects - The article suggests that the capabilities demonstrated by BridgeV2W are just the beginning, with potential advancements as video generation models and training data scale up [21][22]. - The integration of video generation models with embodiment masks presents a promising path for developing scalable and accurate robotic world models, paving the way for general embodied intelligence [17][19].

中科第五纪联合中科院自动化所团队推出 BridgeV2W,让机器人学会"预演未来" - Reportify