Workflow
视觉-语言导航(VLN)
icon
Search documents
最新SOTA!JanusVLN:双重隐式记忆解耦语义与空间,显著降低了计算与推理开销
具身智能之心· 2025-10-07 03:03
Core Insights - The article introduces JanusVLN, an innovative framework for Vision-Language Navigation (VLN) that addresses the limitations of existing methods by implementing a Dual Implicit Memory paradigm, which decouples visual semantics and spatial geometry [2][19]. Background on Current VLN Memory Mechanisms - Current VLN methods face three main challenges: spatial information distortion and loss due to reliance on text cognitive maps, low computational and reasoning efficiency from storing historical image frames, and memory inflation leading to "memory explosion" issues [3][5]. Key Innovations of JanusVLN - JanusVLN introduces a Dual Implicit Memory framework inspired by human cognitive science, effectively separating semantic memory from spatial geometric memory [7][19]. - The framework utilizes a pre-trained 3D visual geometry model (VGGT) to derive spatial geometric information from single RGB video streams, enhancing the model's spatial perception capabilities [8][19]. - A mixed incremental update strategy is employed to maintain a fixed-size memory, significantly improving reasoning efficiency by avoiding redundant computations [8][11]. Methodology Overview - JanusVLN consists of three main components: a dual encoder architecture for visual perception, a dual implicit neural memory system, and a hybrid incremental update strategy [10][11]. - The dual encoder architecture includes a 2D visual semantic encoder and a 3D spatial geometric encoder, working together to provide comprehensive scene understanding [11]. Experimental Results - JanusVLN has been evaluated on two authoritative VLN benchmarks, VLN-CE and RxR-CE, achieving state-of-the-art (SOTA) performance [15]. - The framework demonstrates superior performance in spatial reasoning tasks, successfully completing complex navigation challenges [18][21]. Quantitative Analysis - JanusVLN shows significant improvements in success rate (SR) metrics, outperforming advanced methods that rely on expensive inputs by 10.5 to 35.5 percentage points [21]. - Compared to other SOTA methods using RGB input with explicit memory, JanusVLN achieves a 3.6 to 10.8 percentage point improvement in SR, validating the effectiveness of the Dual Implicit Memory paradigm [21].
AnywhereVLA:在消费级硬件上实时运行VLA
具身智能之心· 2025-09-29 02:08
Core Background and Objectives - The current mobile operation technology is expanding from closed, structured work units to open, unstructured large indoor environments, requiring robots to explore unfamiliar and cluttered spaces, interact with diverse objects and humans, and respond to natural language commands for tasks such as home service, retail automation, and warehousing logistics [3] - AnywhereVLA proposes a modular architecture that integrates the robustness of classical navigation with the semantic understanding capabilities of VLA models to achieve language-driven pick-and-place tasks in unknown large indoor environments, capable of real-time operation on consumer-grade hardware [3] Review of Existing Solutions: Advantages and Limitations - VLA models and lightweight optimization strategies are discussed, highlighting their limitations in spatial perception and adaptability to large environments [4] - Existing solutions like MoManipVLA and SmolVLA show performance close to larger models while reducing resource requirements, but they lack spatial awareness for large environments [4] - The limitations of visual-language navigation (VLN) and classical navigation frameworks are outlined, emphasizing the need for improved language understanding and semantic reasoning capabilities [4] AnywhereVLA Architecture: Four Core Modules and Workflow - The AnywhereVLA architecture processes natural language commands through four modules to output low-level control instructions for driving base wheels and robotic arm joints [4] - The workflow includes language instruction parsing, guiding VLA operations, constructing 3D semantic maps, and executing operations based on the identified targets [7] VLA Model Fine-tuning and Hardware Platform - The SmolVLA model is fine-tuned to enhance its operational capabilities, with specific input data and key steps outlined for optimizing performance [13][15] - The HermesBot mobile operation platform is designed specifically for AnywhereVLA, balancing sensing and computational capabilities [16] Experimental Results: Performance and Effectiveness Validation - In an unknown multi-room laboratory setting, 50 pick-and-place tasks were executed, with a core success rate of 46%, and the fine-tuned SmolVLA operation module achieving an 85% success rate [17][22] - The performance metrics for various modules are provided, indicating robust SLAM performance and varying success rates for active environment exploration, navigation, object detection, and VLA manipulation [22] - Time efficiency metrics show that the average task completion time is under 133 seconds for a 5m exploration radius, meeting real-time scene requirements [23]
VLN-PE:一个具备物理真实性的VLN平台,同时支持人形、四足和轮式机器人(ICCV'25)
具身智能之心· 2025-07-21 08:42
Core Insights - The article introduces VLN-PE, a physically realistic platform for Vision-Language Navigation (VLN), addressing the gap between simulated models and real-world deployment challenges [3][10][15] - The study highlights the significant performance drop (34%) when transferring existing VLN models from simulation to physical environments, emphasizing the need for improved adaptability [15][30] - The research identifies the impact of various factors such as robot type, environmental conditions, and the use of physical controllers on model performance [15][32][38] Background - VLN has emerged as a critical task in embodied AI, requiring agents to navigate complex environments based on natural language instructions [6][8] - Previous models relied on idealized simulations, which do not account for the physical constraints and challenges faced by real robots [9][10] VLN-PE Platform - VLN-PE is built on GRUTopia, supporting various robot types and integrating high-quality synthetic and 3D rendered environments for comprehensive evaluation [10][13] - The platform allows for seamless integration of new scenes, enhancing the scope of VLN research and assessment [10][14] Experimental Findings - The experiments reveal that existing models show a 34% decrease in success rates when transitioning from simulated to physical environments, indicating a significant gap in performance [15][30] - The study emphasizes the importance of multi-modal robustness, with RGB-D models performing better under low-light conditions compared to RGB-only models [15][38] - The findings suggest that training on diverse datasets can improve the generalization capabilities of VLN models across different environments [29][39] Methodologies - The article evaluates various methodologies, including single-step discrete action classification models and multi-step continuous prediction methods, highlighting the potential of diffusion strategies in VLN [20][21] - The research also explores the effectiveness of map-based zero-shot large language models (LLMs) for navigation tasks, demonstrating their potential in VLN applications [24][25] Performance Metrics - The study employs standard VLN evaluation metrics, including trajectory length, navigation error, success rate, and others, to assess model performance [18][19] - Additional metrics are introduced to account for physical realism, such as fall rate and stuck rate, which are critical for evaluating robot performance in real-world scenarios [18][19] Cross-Embodiment Training - The research indicates that cross-embodiment training can enhance model performance, allowing a unified model to generalize across different robot types [36][39] - The findings suggest that using data from multiple robot types during training leads to improved adaptability and performance in various environments [36][39]