Workflow
零样本机器人操作
icon
Search documents
ICRA 2026 | NUS邵林团队提出Goal-VLA:生成式大模型化身「世界模型」,实现零样本机器人操作
机器之心· 2026-03-30 03:00
Core Insights - The article discusses the innovative Goal-VLA framework developed by a team from the National University of Singapore, which addresses the challenge of generalization in robotic manipulation by utilizing an object-centric world model without the need for task-specific fine-tuning or paired action data [3][31]. Group 1: Framework Overview - Goal-VLA employs a decoupled hierarchical framework that connects high-level semantic reasoning with low-level action control through the use of object goal state representations [8][31]. - The system operates using natural language instructions and single-view RGB-D images, eliminating the need for pre-scanned maps or known object meshes [8][31]. Group 2: Execution Process - The execution process of Goal-VLA consists of three key stages: 1. **Natural Language Processing**: Converts user instructions into detailed visual goals using a text-based VLM and an iterative "Reflection-through-Synthesis" mechanism to ensure physical and semantic feasibility of generated images [12][31]. 2. **Spatial Grounding**: Transforms 2D visual goals into precise 3D spatial transformations by extracting pixel-level semantic features and establishing pixel matching between initial and target frames [14][18]. 3. **Low-level Policy**: Converts the object goal poses into executable actions, ensuring collision-free trajectories for task execution [18][22]. Group 3: Experimental Results - In simulations using the RLBench environment, Goal-VLA achieved an average success rate of 59.9% across eight complex tasks, significantly outperforming the MOKA model, which had a success rate of 26.0% [21]. - Real-world tests with the UFACTORY X-ARM 7 robotic arm demonstrated a 60% average success rate across four challenging tasks, showcasing the framework's ability to provide precise spatial guidance for complex operations [22][23]. Group 4: Performance Analysis - The research team conducted an ablation study revealing that enhancing input prompts alone increased the success rate by 27.5%, while the complete "Reflection-through-Synthesis" cycle raised the base success rate from 40.0% to 88.8% with up to three iterations [24].