浙大具身智能VLN+VLA统一框架:ODYSSEY
具身智能之心·2025-08-25 00:04

Core Insights - The article presents the ODYSSEY framework, which integrates hierarchical task planning with terrain-adaptive full-body control, successfully achieving transfer from simulation to reality and demonstrating strong generalization capabilities in diverse environments and long-term tasks [4][38]. Group 1: Research Background - The framework addresses the limitations of existing research in mobile manipulation, particularly in dynamic and unstructured environments, by proposing a unified mobile operation framework for quadruped robots to execute long-term tasks [5]. - A hierarchical visual-language planner is introduced, capable of decomposing long-term instructions based on self-centered perception into executable actions, bridging the gap between self-centered perception and language-based tasks [4][5]. Group 2: Methodology - The framework includes a full-body control strategy defined as a single network that maps comprehensive observation vectors to target actions, incorporating various sensory inputs [9]. - A two-stage training method is employed: the first stage focuses on training movement under static loads, while the second stage controls all joints and expands the reward function to include end-effector tracking [11]. Group 3: Performance Evaluation - The framework was evaluated through a series of long-term mobile operation tasks, covering diverse indoor and outdoor scenarios, with a total of 246 indoor and 58 outdoor variations [18][20]. - Experimental results indicate that the method achieved significant overall improvements across all datasets, demonstrating superior fine manipulation capabilities compared to the baseline model PerAct, especially in unseen data scenarios [17][29]. Group 4: Real-World Application - The ODYSSEY framework was tested in real-world tasks, such as "navigate to grasp" and "grasp and place," using various objects, showcasing its potential for long-term mobile exploration and operation tasks [36][37]. - Despite achieving over 40% overall success rates in all tasks, challenges remain in robust perception and high-precision control for seamless real-world deployment [37][38].