Workflow
视觉-语言导航(VLN)
icon
Search documents
深大团队让机器人听懂指令精准导航,成功率可达72.5%,推理效率提升40%
3 6 Ke· 2025-12-10 07:00
Core Insights - The article discusses the introduction of a new framework called UNeMo for visual-language navigation (VLN), developed by a team led by Professor Li Jianqiang from Shenzhen University in collaboration with Beijing Institute of Technology and Moscow University [1][3]. Group 1: Framework Overview - UNeMo utilizes a multi-modal world model (MWM) and a hierarchical predictive feedback navigator (HPFN) to enhance the decision-making capabilities of navigation agents by allowing them to predict future visual states based on current visual features and language instructions [6][12]. - The framework addresses the disconnection between language reasoning and visual navigation, which has been a significant challenge in embodied AI [6][8]. Group 2: Performance Metrics - UNeMo achieves a navigation success rate of 72.5% in unseen environments, outperforming the existing method NavGPT2, which has a success rate of 71% [15]. - The framework demonstrates a significant reduction in resource consumption, with GPU memory usage dropping from 27GB to 12GB (a 56% reduction) and inference speed improving by 40% [15]. Group 3: Experimental Validation - In experiments on the R2R dataset, UNeMo shows a balance between lightweight configuration and high-performance decision-making, achieving a path efficiency (SPL) improvement from 60% to 61.3% [15]. - UNeMo exhibits a notable advantage in long-path navigation, with a success rate increase of 5.6% for paths longer than 7 units, compared to a mere 1.2% increase for shorter paths [17]. Group 4: Scalability and Adaptability - The framework has been tested across various navigation baselines and datasets, demonstrating its adaptability and scalability beyond LLM-based systems [20]. - UNeMo's collaborative training architecture allows it to effectively apply to different types of navigation tasks, enhancing its overall utility in practical applications [20].
深大团队让机器人听懂指令精准导航!成功率可达72.5%,推理效率提升40%|AAAI2026
Xin Lang Cai Jing· 2025-12-10 06:52
Core Insights - The UNeMo framework, developed by a team led by Professor Li Jianqiang from Shenzhen University, aims to enhance visual-language navigation (VLN) for robots, allowing them to understand commands and navigate accurately in unknown environments [1][18]. Group 1: Framework and Mechanism - UNeMo utilizes a dual-module architecture combining a Multi-modal World Model (MWM) and a Hierarchical Predictive Feedback Navigator (HPFN) to address the disconnection between visual reasoning and navigation decision-making [5][20]. - The MWM predicts future visual states based on current visual features, language instructions, and potential navigation actions, overcoming limitations of existing methods that only focus on the present [21][22]. - The HPFN employs a two-stage hierarchical mechanism to generate coarse-grained candidate actions and refine them based on MWM predictions, ensuring robust navigation in complex environments [24][26]. Group 2: Performance and Efficiency - UNeMo demonstrates significant improvements in resource efficiency, with GPU memory usage reduced by 56% (from 27GB to 12GB) and inference speed increased by 40% (from 1.1 seconds to 0.7 seconds) compared to mainstream methods [27][28]. - In unseen test environments, UNeMo achieves a navigation success rate (SR) of 72.5%, surpassing NavGPT2's 71% by 1.5 percentage points, and improves path efficiency (SPL) from 60% to 61.3% [28][30]. Group 3: Robustness and Scalability - UNeMo shows a marked advantage in long-path navigation, with SR increasing by 5.6% for paths longer than 7 units, compared to a mere 1.2% increase for shorter paths [30][31]. - The framework's adaptability is validated across various navigation baselines and datasets, proving its scalability beyond LLM-based systems [32][33].
最新SOTA!JanusVLN:双重隐式记忆解耦语义与空间,显著降低了计算与推理开销
具身智能之心· 2025-10-07 03:03
Core Insights - The article introduces JanusVLN, an innovative framework for Vision-Language Navigation (VLN) that addresses the limitations of existing methods by implementing a Dual Implicit Memory paradigm, which decouples visual semantics and spatial geometry [2][19]. Background on Current VLN Memory Mechanisms - Current VLN methods face three main challenges: spatial information distortion and loss due to reliance on text cognitive maps, low computational and reasoning efficiency from storing historical image frames, and memory inflation leading to "memory explosion" issues [3][5]. Key Innovations of JanusVLN - JanusVLN introduces a Dual Implicit Memory framework inspired by human cognitive science, effectively separating semantic memory from spatial geometric memory [7][19]. - The framework utilizes a pre-trained 3D visual geometry model (VGGT) to derive spatial geometric information from single RGB video streams, enhancing the model's spatial perception capabilities [8][19]. - A mixed incremental update strategy is employed to maintain a fixed-size memory, significantly improving reasoning efficiency by avoiding redundant computations [8][11]. Methodology Overview - JanusVLN consists of three main components: a dual encoder architecture for visual perception, a dual implicit neural memory system, and a hybrid incremental update strategy [10][11]. - The dual encoder architecture includes a 2D visual semantic encoder and a 3D spatial geometric encoder, working together to provide comprehensive scene understanding [11]. Experimental Results - JanusVLN has been evaluated on two authoritative VLN benchmarks, VLN-CE and RxR-CE, achieving state-of-the-art (SOTA) performance [15]. - The framework demonstrates superior performance in spatial reasoning tasks, successfully completing complex navigation challenges [18][21]. Quantitative Analysis - JanusVLN shows significant improvements in success rate (SR) metrics, outperforming advanced methods that rely on expensive inputs by 10.5 to 35.5 percentage points [21]. - Compared to other SOTA methods using RGB input with explicit memory, JanusVLN achieves a 3.6 to 10.8 percentage point improvement in SR, validating the effectiveness of the Dual Implicit Memory paradigm [21].
AnywhereVLA:在消费级硬件上实时运行VLA
具身智能之心· 2025-09-29 02:08
Core Background and Objectives - The current mobile operation technology is expanding from closed, structured work units to open, unstructured large indoor environments, requiring robots to explore unfamiliar and cluttered spaces, interact with diverse objects and humans, and respond to natural language commands for tasks such as home service, retail automation, and warehousing logistics [3] - AnywhereVLA proposes a modular architecture that integrates the robustness of classical navigation with the semantic understanding capabilities of VLA models to achieve language-driven pick-and-place tasks in unknown large indoor environments, capable of real-time operation on consumer-grade hardware [3] Review of Existing Solutions: Advantages and Limitations - VLA models and lightweight optimization strategies are discussed, highlighting their limitations in spatial perception and adaptability to large environments [4] - Existing solutions like MoManipVLA and SmolVLA show performance close to larger models while reducing resource requirements, but they lack spatial awareness for large environments [4] - The limitations of visual-language navigation (VLN) and classical navigation frameworks are outlined, emphasizing the need for improved language understanding and semantic reasoning capabilities [4] AnywhereVLA Architecture: Four Core Modules and Workflow - The AnywhereVLA architecture processes natural language commands through four modules to output low-level control instructions for driving base wheels and robotic arm joints [4] - The workflow includes language instruction parsing, guiding VLA operations, constructing 3D semantic maps, and executing operations based on the identified targets [7] VLA Model Fine-tuning and Hardware Platform - The SmolVLA model is fine-tuned to enhance its operational capabilities, with specific input data and key steps outlined for optimizing performance [13][15] - The HermesBot mobile operation platform is designed specifically for AnywhereVLA, balancing sensing and computational capabilities [16] Experimental Results: Performance and Effectiveness Validation - In an unknown multi-room laboratory setting, 50 pick-and-place tasks were executed, with a core success rate of 46%, and the fine-tuned SmolVLA operation module achieving an 85% success rate [17][22] - The performance metrics for various modules are provided, indicating robust SLAM performance and varying success rates for active environment exploration, navigation, object detection, and VLA manipulation [22] - Time efficiency metrics show that the average task completion time is under 133 seconds for a 5m exploration radius, meeting real-time scene requirements [23]
VLN-PE:一个具备物理真实性的VLN平台,同时支持人形、四足和轮式机器人(ICCV'25)
具身智能之心· 2025-07-21 08:42
Core Insights - The article introduces VLN-PE, a physically realistic platform for Vision-Language Navigation (VLN), addressing the gap between simulated models and real-world deployment challenges [3][10][15] - The study highlights the significant performance drop (34%) when transferring existing VLN models from simulation to physical environments, emphasizing the need for improved adaptability [15][30] - The research identifies the impact of various factors such as robot type, environmental conditions, and the use of physical controllers on model performance [15][32][38] Background - VLN has emerged as a critical task in embodied AI, requiring agents to navigate complex environments based on natural language instructions [6][8] - Previous models relied on idealized simulations, which do not account for the physical constraints and challenges faced by real robots [9][10] VLN-PE Platform - VLN-PE is built on GRUTopia, supporting various robot types and integrating high-quality synthetic and 3D rendered environments for comprehensive evaluation [10][13] - The platform allows for seamless integration of new scenes, enhancing the scope of VLN research and assessment [10][14] Experimental Findings - The experiments reveal that existing models show a 34% decrease in success rates when transitioning from simulated to physical environments, indicating a significant gap in performance [15][30] - The study emphasizes the importance of multi-modal robustness, with RGB-D models performing better under low-light conditions compared to RGB-only models [15][38] - The findings suggest that training on diverse datasets can improve the generalization capabilities of VLN models across different environments [29][39] Methodologies - The article evaluates various methodologies, including single-step discrete action classification models and multi-step continuous prediction methods, highlighting the potential of diffusion strategies in VLN [20][21] - The research also explores the effectiveness of map-based zero-shot large language models (LLMs) for navigation tasks, demonstrating their potential in VLN applications [24][25] Performance Metrics - The study employs standard VLN evaluation metrics, including trajectory length, navigation error, success rate, and others, to assess model performance [18][19] - Additional metrics are introduced to account for physical realism, such as fall rate and stuck rate, which are critical for evaluating robot performance in real-world scenarios [18][19] Cross-Embodiment Training - The research indicates that cross-embodiment training can enhance model performance, allowing a unified model to generalize across different robot types [36][39] - The findings suggest that using data from multiple robot types during training leads to improved adaptability and performance in various environments [36][39]