Manual2Skill

Search documents
RSS 2025|从说明书学习复杂机器人操作任务:NUS邵林团队提出全新机器人装配技能学习框架Manual2Skill
机器之心· 2025-05-29 04:53
Core Viewpoint - The article discusses the development of Manual2Skill, an innovative framework that utilizes Vision-Language Models (VLMs) to enable robots to autonomously understand and execute complex furniture assembly tasks based on visual manuals, bridging the gap between abstract instructions and physical execution [3][35]. Summary by Sections Research Background - Furniture assembly is a complex long-term task requiring robots to understand part relationships, estimate poses, and generate feasible actions. Existing methods often rely on imitation or reinforcement learning, which require large datasets and computational resources, limiting their applicability in real-world scenarios [6][35]. Manual2Skill Framework - Manual2Skill consists of three core phases: 1. **Hierarchical Assembly Diagram Generation**: Converts human-readable manuals into executable task plans using VLMs to generate a hierarchical assembly diagram that encodes the relationships between furniture parts [10][14]. 2. **Step-by-Step Pose Estimation**: Predicts the 6D poses of all parts involved in each assembly step, allowing for precise physical alignment. This method improves learning of basic connection methods across different furniture shapes [12][13]. 3. **Robot Assembly Action Generation and Execution**: Translates predicted poses into real-world robot actions, employing heuristic grasping strategies and robust motion planning algorithms for part manipulation [18][35]. Experimental Results and Analysis - The framework was tested on various IKEA furniture in both simulation and real environments, demonstrating robustness and effectiveness. The hierarchical assembly diagram generation method showed superior performance compared to baseline methods, especially for simple to medium complexity furniture [20][29][35]. Conclusion and Outlook - Manual2Skill represents a new paradigm in robotic learning, allowing robots to learn complex operational skills from human-designed manuals, significantly reducing the cost and complexity of skill acquisition. The framework captures the underlying structure and logic of operations, enabling effective generalization across different configurations and conditions [35].