Workflow
YOTO双臂协同精巧操作技术框架
icon
Search documents
抢跑特斯拉,中国团队用视频学习教机器人学会操作
机器人大讲堂· 2025-09-28 00:30
Core Insights - Tesla's decision to utilize employee operation videos for training its Optimus robot signifies a transformative shift in embodied intelligence learning paradigms, moving away from traditional motion capture methods [1] - The Chinese team at Kuawei Intelligent has already implemented a similar approach with their YOTO (You Only Teach Once) framework, demonstrating the ability to train dual-arm robots using just 30 seconds of video, achieving high generalization capabilities without extensive real machine data [1][2] Video Learning Framework - The upgraded video learning framework allows dual-arm robots to autonomously recognize the state of task objects and achieve a task success rate of 95%, even in the presence of random disturbances [2] - Video learning translates human-exposed spatiotemporal behavior patterns and semantic intentions into executable operation strategies, significantly reducing reliance on manual teaching or expensive remote operation data [2][3] Challenges in Video Learning - Video learning faces inherent challenges such as differences in embodiment, lack of physical interaction information, perception noise, and difficulties in maintaining long-term consistency and phase-based strategy learning [3][4] - Recent research efforts are focused on addressing these challenges through large-scale video pre-training and unsupervised video distillation to achieve generalizable visual-action representations [3][4] Solutions to Core Deficiencies - To tackle embodiment differences and long-term consistency, the team simplifies human demonstrations into semantic keyframe sequences and motion masks, enhancing stability and ease of correction in motion redirection [5] - The framework employs demonstration-driven rapid example proliferation and lightweight visual alignment modules to establish reliable correspondences between visual and real execution, significantly improving task success rates under dynamic disturbances [7][11] Integration with Large Models - The framework complements the trend of using large models for semantic guidance, combining multimodal large models for robust perception with keyframe and diffusion strategies for action representation and generation [8] - This dual approach reflects industry trends where companies like Google and Tesla are exploring the integration of large-scale multimodal models with robotic control to enhance cross-task generalization [8][9] Data Pyramid Concept - The video imitation learning sample sources are stratified into a data pyramid, with the base consisting of vast amounts of unlabelled internet videos, the middle layer comprising semi-structured human demonstration data, and the top layer containing verified real machine data [9][11] - The design philosophy of the Kuawei Intelligent video learning framework aims to leverage lower and middle-layer videos for rapid semantic and spatiotemporal prior acquisition, creating a closed-loop system that is efficient, scalable, and verifiable [11] Sim2Real and Robustness - The innovative video learning framework, combined with Sim2Real techniques, enables the VLA model to exhibit strong generalization performance, achieving a task success rate of over 95% in home service scenarios with minimal real data samples [12][14] - The dual-arm robot demonstrates high robustness and adaptability, capable of autonomously identifying which arm to use based on proximity to the task object, showcasing the model's potential for intelligent, scalable deployment across various environments [15][17] Future of Embodied Intelligence - The evolution of this technology is set to redefine industrial intelligence development paths, moving towards a "全民共创" (全民共创) era where robots can learn from everyday demonstrations, thus broadening their applicability across industries [19] - The success of the Kuawei Intelligent video learning framework illustrates that video is not merely a data carrier but a universal language for robots to understand the world, enabling knowledge transfer across time and space [19]