Workflow
仿真到现实(Sim2Real)迁移
icon
Search documents
中科院自动化所!视觉-触觉-语言-动作模型方案与数据集制作分享
具身智能之心· 2025-07-30 00:02
Core Viewpoint - The article discusses the development of a Vision-Tactile-Language-Action (VTLA) model aimed at enhancing robot manipulation tasks, particularly in contact-intensive scenarios, by integrating visual and tactile inputs with language instructions [2]. Group 1: Model Development - The VTLA framework addresses the gap in applying visual language models (VLM) to language-conditioned robotic operations, especially beyond visually dominated tasks [2]. - A low-cost multimodal dataset was created in a simulated environment, specifically designed for fingertip insertion tasks, which includes visual-tactile-action-instruction pairs [2]. Group 2: Performance and Results - The VTLA model achieved over 90% success rate on unknown hole types, significantly outperforming traditional imitation learning methods and existing multimodal baselines [2]. - The model's capability was validated through real-world hole axis assembly experiments, demonstrating its superior simulation-to-reality (Sim2Real) transfer ability [2].