SIASUN-深大团队让机器人听懂指令精准导航！成功率可达72.5%，推理效率提升40%|AAAI2026

Core Insights - The UNeMo framework, developed by a team led by Professor Li Jianqiang from Shenzhen University, aims to enhance visual-language navigation (VLN) for robots, allowing them to understand commands and navigate accurately in unknown environments [1][18]. Group 1: Framework and Mechanism - UNeMo utilizes a dual-module architecture combining a Multi-modal World Model (MWM) and a Hierarchical Predictive Feedback Navigator (HPFN) to address the disconnection between visual reasoning and navigation decision-making [5][20]. - The MWM predicts future visual states based on current visual features, language instructions, and potential navigation actions, overcoming limitations of existing methods that only focus on the present [21][22]. - The HPFN employs a two-stage hierarchical mechanism to generate coarse-grained candidate actions and refine them based on MWM predictions, ensuring robust navigation in complex environments [24][26]. Group 2: Performance and Efficiency - UNeMo demonstrates significant improvements in resource efficiency, with GPU memory usage reduced by 56% (from 27GB to 12GB) and inference speed increased by 40% (from 1.1 seconds to 0.7 seconds) compared to mainstream methods [27][28]. - In unseen test environments, UNeMo achieves a navigation success rate (SR) of 72.5%, surpassing NavGPT2's 71% by 1.5 percentage points, and improves path efficiency (SPL) from 60% to 61.3% [28][30]. Group 3: Robustness and Scalability - UNeMo shows a marked advantage in long-path navigation, with SR increasing by 5.6% for paths longer than 7 units, compared to a mere 1.2% increase for shorter paths [30][31]. - The framework's adaptability is validated across various navigation baselines and datasets, proving its scalability beyond LLM-based systems [32][33].