上海交大最新！DyNaVLM：零样本、端到端导航框架

Core Viewpoint - The article discusses the development of DyNaVLM, a zero-shot, end-to-end navigation framework that integrates vision-language models (VLM) to enhance navigation capabilities in dynamic environments, overcoming limitations of traditional methods [4][5]. Group 1: Introduction and Optimization Goals - Navigation is a fundamental capability in autonomous agents, requiring spatial reasoning, real-time decision-making, and adaptability to dynamic environments. Traditional methods face challenges in generalization and scalability due to their modular design [4]. - The advancement of VLMs offers new possibilities for navigation by integrating perception and reasoning within a single framework, although their application in embodied navigation is limited by spatial granularity and contextual reasoning capabilities [4]. Group 2: Core Innovations of DyNaVLM - Dynamic Action Space Construction: DyNaVLM introduces a dynamic action space that allows robots to determine navigation goals based on visual information and language instructions, enhancing movement flexibility in complex environments [6]. - Collaborative Graph Memory Mechanism: Inspired by retrieval-augmented generation (RAG), this mechanism enhances memory management for better navigation performance [8]. - No-Training Deployment Mode: DyNaVLM can be deployed without task-specific fine-tuning, reducing deployment costs and improving generalization across different environments and tasks [8]. Group 3: System Architecture and Methodology - Problem Formalization: The system takes inputs such as target descriptions and RGB-D observations to determine appropriate actions, maintaining a memory function to extract spatial features [11]. - Memory Manager: This component connects VLM and graph-structured memory, capturing spatial relationships and semantic object information [12]. - Action Proposer and Selector: The action proposer simplifies continuous search space into discrete candidates, while the selector generates final navigation actions based on geometric candidates and contextual memory [14][15]. Group 4: Experimental Evaluation - Simulation Environment Evaluation: DyNaVLM achieved a success rate (SR) of 45.0% and a path length weighted success rate (SPL) of 0.232 in ObjectNav benchmarks, outperforming previous VLM frameworks [19][22]. - Real-World Evaluation: DyNaVLM demonstrated superior performance in real-world settings, particularly in tasks requiring the identification of multiple targets, showcasing its robustness and efficiency in dynamic environments [27].