Zero-Shot Transfer

Search documents
VLFly:基于开放词汇目标理解的无人机视觉语言导航
具身智能之心· 2025-07-20 01:06
Core Viewpoint - The article presents the VLFly framework, a novel vision-language navigation system for drones that enables open-vocabulary goal understanding and zero-shot transfer without task-specific fine-tuning, allowing navigation based solely on natural language instructions and visual information captured by the drone's monocular camera [8][19]. Research Background - The importance of vision-language navigation lies in enabling robots to execute complex tasks based on natural language commands, with applications in home assistance, urban inspection, and environmental exploration [3]. - Existing research methods have limitations, particularly in high-level semantic intent interpretation and integration of natural language input [9]. Task Definition - The vision-language navigation task for drones is defined as a partially observable Markov decision process (POMDP), consisting of state space, action space, observation space, and state transition probabilities [5]. Framework Composition - The VLFly framework consists of three modules: natural language understanding, cross-modal target localization, and navigable waypoint generation, effectively bridging the gap between semantic instructions and continuous drone control commands [8]. Module Details - **Instruction Encoding Module**: Converts natural language instructions into structured text prompts using the LLaMA language model [11]. - **Target Retrieval Module**: Selects the most semantically relevant image from a predefined pool based on the text prompt using the CLIP model [10]. - **Waypoint Planning Module**: Generates executable waypoint trajectories based on current observations and target images [12]. Experimental Setup - The framework was evaluated in diverse simulated and real-world environments, demonstrating strong generalization capabilities and outperforming all baseline methods [8][18]. - Evaluation metrics included success rate (SR), oracle success rate (OS), success rate weighted by path length (SPL), and navigation error (NE) [12]. Experimental Results - VLFly outperformed baseline methods across all metrics, particularly in unseen environments, showcasing robust performance in both indoor and outdoor settings [18]. - The framework achieved a success rate of 83% for direct instructions and 70% for indirect instructions [18]. Conclusion and Future Work - VLFly is a new VLN framework designed specifically for drones, capable of navigation using only visual information captured by its monocular camera [19]. - Future work includes expanding the training dataset for waypoint planning to support full 3D maneuvers and exploring the potential of vision-language models in dynamically identifying target candidates in open-world environments [19].