Open World Mobile Manipulation
Search documents
突破开放世界移动操作!首个室内移动抓取多模态智能体亮相,微调模型真实环境零样本动作准确率达 90%
机器之心· 2025-06-20 11:59
Core Insights - The article discusses the development of "OWMM-Agent," a multimodal intelligent agent architecture specifically designed for Open World Mobile Manipulation (OWMM), achieving unified modeling of global scene understanding, robot state tracking, and multimodal action generation [1][5]. Background - Traditional mobile manipulation robots struggle with open-ended tasks in dynamic environments, requiring pre-built 3D reconstructions or semantic maps, which are time-consuming and inefficient [5]. - The OWMM task's challenges include global scene reasoning, embodied decision-making, and system integration to derive low-level control targets from a VLM base model [5]. Multimodal Agent Architecture - The OWMM problem is modeled as a multi-round, multi-image reasoning and grounding task, allowing the multimodal large model to perform end-to-end perception, reasoning, decision-making, and state updating [6]. - The team designed a data synthesis scheme based on the Habitat simulation platform to address the "hallucination" issue in VLM models, incorporating long-term environmental memory and transient state memory [8][9]. Experimental Validation - In simulated environments, the OWMM-VLM model demonstrated significant advantages, achieving a 90% zero-shot action generation success rate in real-world tests with the Fetch robot [12]. - The model successfully executed tasks such as moving a soy milk box from a desk to a conference table, showcasing strong generalization capabilities [12]. Future Outlook - This research establishes that a VLM model fine-tuned with large-scale simulated data can serve as a universal foundational model for open-world mobile operations [14]. - The advancements made by OWMM-Agent lay the groundwork for the development of general-purpose household robots, potentially enabling voice-commanded home assistance in the near future [15].