视觉-语言导航（VLN） - filings, earnings calls, financial reports, news

视觉-语言导航（VLN）

Search documents

3 6 Ke· 2025-12-10 07:00

让机器人听懂指令，精准导航再升级！深圳大学李坚强教授团队最近联合北京理工莫斯科大学等机构，提出视觉-语言导航（VLN）新框架——UNeMo。以下是更多详细内容。语言推理与视觉导航的"脱节困境" 作为Embodied AI的核心任务之一，视觉-语言导航要求智能体仅凭视觉图像和自然语言指令，在未知环境中自主完成目标导航。通过多模态世界模型与分层预测反馈机制，能够让导航智能体不仅可以看到当前环境，还能预测接下来可能看到的内容，并据此做出更聪明的决策。相比主流方法，UNeMo可大幅度降低资源消耗，在未见过的环境中导航成功率可达72.5%，尤其是在长轨迹导航中表现突出。目前，该论文已入选AAAI2026。而随着大语言模型（LLM）的兴起，基于LLM的导航方法虽取得进展，但仍面临两大关键瓶颈：双模块协同打造"预判+决策"闭环于是研究团队提出了UNeMo框架，其核心突破在于构建了"多模态世界模型（MWM）+分层预测反馈导航器（HPFN）"的双向协同架构，将视觉状态推理与导航决策深度绑定，从根本上解决现有方法的脱节问题。基于多模态世界模型的未来视觉状态预测 MWM基于条件变分自编码器构建，核心是精准预判 ...

SIASUN(SZ:300024)

视觉-语言导航（VLN）

大语言模型（LLM）

Artificial Intelligence

UNeMo

视觉-语言导航（VLN）

大语言模型（LLM）

Artificial Intelligence

UNeMo

深大团队让机器人听懂指令精准导航！成功率可达72.5%，推理效率提升40%|AAAI2026

Xin Lang Cai Jing· 2025-12-10 06:52

UNeMo团队投稿量子位 | 公众号 QbitAI 让机器人听懂指令，精准导航再升级！深圳大学李坚强教授团队最近联合北京理工莫斯科大学等机构，提出视觉-语言导航（VLN）新框架——UNeMo。通过多模态世界模型与分层预测反馈机制，能够让导航智能体不仅可以看到当前环境，还能预测接下来可能看到的内容，并据此做出更聪明的决策。相比主流方法，UNeMo可大幅度降低资源消耗，在未见过的环境中导航成功率可达72.5%，尤其是在长轨迹导航中表现突出。目前，该论文已入选AAAI2026。以下是更多详细内容。语言推理与视觉导航的"脱节困境" 作为Embodied AI的核心任务之一，视觉-语言导航要求智能体仅凭视觉图像和自然语言指令，在未知环境中自主完成目标导航。而随着大语言模型（LLM）的兴起，基于LLM的导航方法虽取得进展，但仍面临两大关键瓶颈：双模块协同打造"预判+决策"闭环于是研究团队提出了UNeMo框架，其核心突破在于构建了"多模态世界模型（MWM）+分层预测反馈导航器（HPFN）"的双向协同架构，将视觉状态推理与导航决策深度绑定，从根本上解决现有方法的脱节问题。基于多模态世界模型的未来视觉 ...

Artificial Intelligence

Artificial Intelligence

UNeMo

具身智能之心· 2025-10-07 03:03

Core Insights - The article introduces JanusVLN, an innovative framework for Vision-Language Navigation (VLN) that addresses the limitations of existing methods by implementing a Dual Implicit Memory paradigm, which decouples visual semantics and spatial geometry [2][19]. Background on Current VLN Memory Mechanisms - Current VLN methods face three main challenges: spatial information distortion and loss due to reliance on text cognitive maps, low computational and reasoning efficiency from storing historical image frames, and memory inflation leading to "memory explosion" issues [3][5]. Key Innovations of JanusVLN - JanusVLN introduces a Dual Implicit Memory framework inspired by human cognitive science, effectively separating semantic memory from spatial geometric memory [7][19]. - The framework utilizes a pre-trained 3D visual geometry model (VGGT) to derive spatial geometric information from single RGB video streams, enhancing the model's spatial perception capabilities [8][19]. - A mixed incremental update strategy is employed to maintain a fixed-size memory, significantly improving reasoning efficiency by avoiding redundant computations [8][11]. Methodology Overview - JanusVLN consists of three main components: a dual encoder architecture for visual perception, a dual implicit neural memory system, and a hybrid incremental update strategy [10][11]. - The dual encoder architecture includes a 2D visual semantic encoder and a 3D spatial geometric encoder, working together to provide comprehensive scene understanding [11]. Experimental Results - JanusVLN has been evaluated on two authoritative VLN benchmarks, VLN-CE and RxR-CE, achieving state-of-the-art (SOTA) performance [15]. - The framework demonstrates superior performance in spatial reasoning tasks, successfully completing complex navigation challenges [18][21]. Quantitative Analysis - JanusVLN shows significant improvements in success rate (SR) metrics, outperforming advanced methods that rely on expensive inputs by 10.5 to 35.5 percentage points [21]. - Compared to other SOTA methods using RGB input with explicit memory, JanusVLN achieves a 3.6 to 10.8 percentage point improvement in SR, validating the effectiveness of the Dual Implicit Memory paradigm [21].

AnywhereVLA：在消费级硬件上实时运行VLA

具身智能之心· 2025-09-29 02:08

Core Background and Objectives - The current mobile operation technology is expanding from closed, structured work units to open, unstructured large indoor environments, requiring robots to explore unfamiliar and cluttered spaces, interact with diverse objects and humans, and respond to natural language commands for tasks such as home service, retail automation, and warehousing logistics [3] - AnywhereVLA proposes a modular architecture that integrates the robustness of classical navigation with the semantic understanding capabilities of VLA models to achieve language-driven pick-and-place tasks in unknown large indoor environments, capable of real-time operation on consumer-grade hardware [3] Review of Existing Solutions: Advantages and Limitations - VLA models and lightweight optimization strategies are discussed, highlighting their limitations in spatial perception and adaptability to large environments [4] - Existing solutions like MoManipVLA and SmolVLA show performance close to larger models while reducing resource requirements, but they lack spatial awareness for large environments [4] - The limitations of visual-language navigation (VLN) and classical navigation frameworks are outlined, emphasizing the need for improved language understanding and semantic reasoning capabilities [4] AnywhereVLA Architecture: Four Core Modules and Workflow - The AnywhereVLA architecture processes natural language commands through four modules to output low-level control instructions for driving base wheels and robotic arm joints [4] - The workflow includes language instruction parsing, guiding VLA operations, constructing 3D semantic maps, and executing operations based on the identified targets [7] VLA Model Fine-tuning and Hardware Platform - The SmolVLA model is fine-tuned to enhance its operational capabilities, with specific input data and key steps outlined for optimizing performance [13][15] - The HermesBot mobile operation platform is designed specifically for AnywhereVLA, balancing sensing and computational capabilities [16] Experimental Results: Performance and Effectiveness Validation - In an unknown multi-room laboratory setting, 50 pick-and-place tasks were executed, with a core success rate of 46%, and the fine-tuned SmolVLA operation module achieving an 85% success rate [17][22] - The performance metrics for various modules are provided, indicating robust SLAM performance and varying success rates for active environment exploration, navigation, object detection, and VLA manipulation [22] - Time efficiency metrics show that the average task completion time is under 133 seconds for a 5m exploration radius, meeting real-time scene requirements [23]

VLN-PE：一个具备物理真实性的VLN平台，同时支持人形、四足和轮式机器人（ICCV'25）

具身智能之心· 2025-07-21 08:42

Core Insights - The article introduces VLN-PE, a physically realistic platform for Vision-Language Navigation (VLN), addressing the gap between simulated models and real-world deployment challenges [3][10][15] - The study highlights the significant performance drop (34%) when transferring existing VLN models from simulation to physical environments, emphasizing the need for improved adaptability [15][30] - The research identifies the impact of various factors such as robot type, environmental conditions, and the use of physical controllers on model performance [15][32][38] Background - VLN has emerged as a critical task in embodied AI, requiring agents to navigate complex environments based on natural language instructions [6][8] - Previous models relied on idealized simulations, which do not account for the physical constraints and challenges faced by real robots [9][10] VLN-PE Platform - VLN-PE is built on GRUTopia, supporting various robot types and integrating high-quality synthetic and 3D rendered environments for comprehensive evaluation [10][13] - The platform allows for seamless integration of new scenes, enhancing the scope of VLN research and assessment [10][14] Experimental Findings - The experiments reveal that existing models show a 34% decrease in success rates when transitioning from simulated to physical environments, indicating a significant gap in performance [15][30] - The study emphasizes the importance of multi-modal robustness, with RGB-D models performing better under low-light conditions compared to RGB-only models [15][38] - The findings suggest that training on diverse datasets can improve the generalization capabilities of VLN models across different environments [29][39] Methodologies - The article evaluates various methodologies, including single-step discrete action classification models and multi-step continuous prediction methods, highlighting the potential of diffusion strategies in VLN [20][21] - The research also explores the effectiveness of map-based zero-shot large language models (LLMs) for navigation tasks, demonstrating their potential in VLN applications [24][25] Performance Metrics - The study employs standard VLN evaluation metrics, including trajectory length, navigation error, success rate, and others, to assess model performance [18][19] - Additional metrics are introduced to account for physical realism, such as fall rate and stuck rate, which are critical for evaluating robot performance in real-world scenarios [18][19] Cross-Embodiment Training - The research indicates that cross-embodiment training can enhance model performance, allowing a unified model to generalize across different robot types [36][39] - The findings suggest that using data from multiple robot types during training leads to improved adaptability and performance in various environments [36][39]