Human2Robot
Search documents
AAAI 2026 Oral | 机器人也能“看人学活”?一次示范就能学会新任务!
具身智能之心· 2025-12-12 01:22
Core Insights - The article discusses a novel approach to robot learning through human demonstration, emphasizing the importance of fine-grained action alignment between human and robot movements [3][4][8]. - The proposed method, Human2Robot, utilizes a new dataset (H&R) and a two-stage framework to enhance robot learning capabilities, enabling one-shot generalization to new tasks [3][4][9]. Summary by Sections Introduction - The article introduces the limitations of existing methods that rely on coarse alignment of human-robot video pairs, which often leads to a lack of understanding of fine-grained actions necessary for task generalization [3][8]. Methodology - A new dataset, H&R, consisting of 2,600 synchronized human and robot action videos, is introduced to facilitate better learning [9]. - The Human2Robot framework consists of two main stages: Video Prediction Model (VPM) and Action Decoder [12][16]. Video Prediction Model (VPM) - The VPM generates robot action videos based on human demonstrations, allowing the model to learn detailed action dynamics [13][14]. - The model captures key information about the robot's shape and human hand movements through Spatial UNet and Spatial-Temporal UNet [15]. Action Decoder - The Action Decoder translates the generated video features into specific robot movements, enabling real-time task execution without needing continuous video input [16][20]. Experimental Results - Human2Robot outperforms existing baseline methods by maintaining a success rate improvement of over 10-20% across various tasks, demonstrating its effectiveness in leveraging detailed human video conditions [20][27]. - The introduction of KNN in the Human2Robot framework shows that it can still perform well even without direct demonstration input, indicating robust task execution capabilities [20][27]. Generalization Capability - Human2Robot exhibits strong generalization across different tasks, including new positions and object instances, due to the clear action correspondences established by the H&R dataset [27]. Ablation Studies - The effectiveness of the VPM is validated through experiments showing that relying solely on human video input leads to poor performance, highlighting the necessity of the video generation process for reliable action mapping [25][26].