NavA3框架：理解任何指令，导航到任何地方找任何目标（清华大学）

Core Insights - The article introduces the concept of embodied navigation, emphasizing the gap between current research and the complex, open-ended navigation tasks that humans perform in real environments [3][4] - A new long-range navigation task is proposed, requiring agents to understand advanced human instructions and navigate in real-world settings, leading to the development of a hierarchical framework called NavA³ [4][6] Research Background and Motivation - Embodied navigation is essential for agents to move and interact within physical environments, but existing studies focus on predefined object navigation or instruction following, which do not meet the nuanced demands of human navigation [3] Key Contributions - A challenging long-range navigation task is introduced, requiring agents to comprehend advanced human instructions and locate objects with complex spatial relationships in indoor environments [6] - The NavA³ framework is designed to combine global and local strategies for understanding diverse high-level instructions, cross-region navigation, and object localization [11] - A dataset containing 1 million samples of spatial perception object affordance is constructed to train the NaviAfford model, enabling it to understand complex spatial relationships and achieve precise object pointing [11] Methodology Framework: NavA³ - NavA³ employs a "global to local" hierarchical strategy, integrating semantic reasoning with precise spatial localization to tackle long-range navigation tasks [9] - The global strategy involves parsing instructions and determining target areas using a Reasoning-VLM model, which translates high-level human instructions into executable navigation goals [12] - The local strategy focuses on exploration within the target area and precise object localization, utilizing the NaviAfford model trained on the spatial perception dataset [17] Experimental Validation - Experiments were conducted across five scenarios with 50 tasks, evaluating performance through navigation error (NE) and success rate (SR), with NavA³ outperforming existing methods [22] - NavA³ achieved an average success rate of 66.4%, significantly higher than the best baseline method, MapNav, which had a success rate of 25.2% [23] Ablation Studies - The impact of annotations was significant, with complete annotations improving success rates in specific areas by 28.0% and 36.0% [26] - The Reasoning-VLM model demonstrated a substantial increase in average success rates when using advanced reasoning capabilities compared to open-source models [27] Qualitative Analysis - NavA³ effectively understands spatial relationships and can navigate from complex instructions, showcasing adaptability across different robotic platforms [34]