具身场景新框架！Embodied-Reasoner：攻克复杂具身交互任务

Core Viewpoint - The article presents the Embodied Reasoner framework, which extends deep reasoning capabilities to embodied interactive tasks, addressing unique challenges such as multimodal interaction and diverse reasoning patterns [3][7][19]. Group 1: Research Background - Recent advancements in deep reasoning models, like OpenAI's o1, have shown exceptional capabilities in mathematical and programming tasks through large-scale reinforcement learning [7]. - However, the effectiveness of these models in embodied domains requiring continuous interaction with the environment has not been fully explored [7]. - The research aims to expand deep reasoning capabilities to embodied interactive tasks, tackling challenges such as multimodal interaction and diverse reasoning patterns [7]. Group 2: Embodied Interaction Task Design - A high-level planning and reasoning embodied task was designed, focusing on searching for hidden objects in unknown rooms rather than low-level motion control [8]. - The task environment is built on the AI2-THOR simulator, featuring 120 unique indoor scenes and 2100 objects [8]. - Four common tasks were designed: Search, Manipulate, Transport, and Composite Tasks [8]. Group 3: Data Engine and Training Strategy - A data engine was developed to synthesize diverse reasoning processes, presenting embodied reasoning trajectories in an observe-think-act format [3]. - A three-stage iterative training process was introduced, including imitation learning, rejection sampling adjustment, and reflection adjustment, enhancing the model's interaction, exploration, and reflection capabilities [3][19]. - The training corpus synthesized 9390 unique task instructions and their corresponding observe-think-act trajectories, covering 107 different indoor scenes and 2100 interactive objects [12][16]. Group 4: Experimental Results - The model demonstrated significant advantages over existing advanced models, particularly in complex long-duration tasks, showing more consistent reasoning capabilities and efficient search behavior [3][18]. - In real-world experiments, the Embodied Reasoner achieved a success rate of 56.7% across 30 tasks, outperforming OpenAI's o1 and o3-mini [17]. - The model's success rate improved by 9%, 24%, and 13% compared to GPT-o1, GPT-o3-mini, and Claude-3.7-Sonnet-thinking, respectively [18]. Group 5: Conclusion and Future Work - The research successfully extends the deep reasoning paradigm to embodied interactive tasks, demonstrating enhanced interaction and reasoning capabilities, especially in complex long-duration tasks [19]. - Future work may explore the application of the model to a wider variety of embodied tasks and improve its generalization and adaptability in real-world environments [19].