UNeMo
Search documents
深大团队让机器人精准导航!成功率可达72.5%,推理效率+40%
具身智能之心· 2025-12-11 02:01
Core Insights - The article discusses the introduction of the UNeMo framework for visual-language navigation (VLN), which significantly improves navigation success rates and reduces resource consumption compared to mainstream methods [4][10][33]. Group 1: UNeMo Framework Overview - UNeMo integrates a multi-modal world model (MWM) and a hierarchical predictive feedback navigator (HPFN) to address the disconnection between reasoning and decision-making in existing VLN methods [10][33]. - The framework allows navigation agents to predict future visual states based on current visual features and language instructions, enhancing decision-making capabilities [11][12]. Group 2: Performance Metrics - In experiments on the R2R dataset, UNeMo achieved a navigation success rate (SR) of 72.5% in unseen environments, surpassing NavGPT2's 71% by 1.5 percentage points [25]. - UNeMo's model parameters are only 30% of those used by NavGPT2, leading to a 56% reduction in GPU memory usage during training and a 40% increase in inference speed [23][24]. Group 3: Robustness in Complex Scenarios - UNeMo demonstrated a 5.6% increase in SR for long-path navigation (≥7 length), compared to a mere 1.2% increase for short-path navigation (<7 length), indicating its effectiveness in mitigating cumulative errors in long-distance tasks [28][29]. Group 4: Cross-Scenario Adaptability - The framework was tested across various navigation baselines and datasets, showing improved SR and remote goal success rates (RGS) in unseen scenarios, confirming its adaptability beyond LLM-based systems [31][32]. Group 5: Conclusion - UNeMo addresses the challenges of high resource consumption and the disconnection between reasoning and decision-making in traditional VLN methods, offering a lightweight yet high-performance solution for practical applications in service robotics and advancing the VLN field [33].
深大团队让机器人听懂指令精准导航,成功率可达72.5%,推理效率提升40%
3 6 Ke· 2025-12-10 07:00
Core Insights - The article discusses the introduction of a new framework called UNeMo for visual-language navigation (VLN), developed by a team led by Professor Li Jianqiang from Shenzhen University in collaboration with Beijing Institute of Technology and Moscow University [1][3]. Group 1: Framework Overview - UNeMo utilizes a multi-modal world model (MWM) and a hierarchical predictive feedback navigator (HPFN) to enhance the decision-making capabilities of navigation agents by allowing them to predict future visual states based on current visual features and language instructions [6][12]. - The framework addresses the disconnection between language reasoning and visual navigation, which has been a significant challenge in embodied AI [6][8]. Group 2: Performance Metrics - UNeMo achieves a navigation success rate of 72.5% in unseen environments, outperforming the existing method NavGPT2, which has a success rate of 71% [15]. - The framework demonstrates a significant reduction in resource consumption, with GPU memory usage dropping from 27GB to 12GB (a 56% reduction) and inference speed improving by 40% [15]. Group 3: Experimental Validation - In experiments on the R2R dataset, UNeMo shows a balance between lightweight configuration and high-performance decision-making, achieving a path efficiency (SPL) improvement from 60% to 61.3% [15]. - UNeMo exhibits a notable advantage in long-path navigation, with a success rate increase of 5.6% for paths longer than 7 units, compared to a mere 1.2% increase for shorter paths [17]. Group 4: Scalability and Adaptability - The framework has been tested across various navigation baselines and datasets, demonstrating its adaptability and scalability beyond LLM-based systems [20]. - UNeMo's collaborative training architecture allows it to effectively apply to different types of navigation tasks, enhancing its overall utility in practical applications [20].
深大团队让机器人听懂指令精准导航!成功率可达72.5%,推理效率提升40%|AAAI2026
Xin Lang Cai Jing· 2025-12-10 06:52
Core Insights - The UNeMo framework, developed by a team led by Professor Li Jianqiang from Shenzhen University, aims to enhance visual-language navigation (VLN) for robots, allowing them to understand commands and navigate accurately in unknown environments [1][18]. Group 1: Framework and Mechanism - UNeMo utilizes a dual-module architecture combining a Multi-modal World Model (MWM) and a Hierarchical Predictive Feedback Navigator (HPFN) to address the disconnection between visual reasoning and navigation decision-making [5][20]. - The MWM predicts future visual states based on current visual features, language instructions, and potential navigation actions, overcoming limitations of existing methods that only focus on the present [21][22]. - The HPFN employs a two-stage hierarchical mechanism to generate coarse-grained candidate actions and refine them based on MWM predictions, ensuring robust navigation in complex environments [24][26]. Group 2: Performance and Efficiency - UNeMo demonstrates significant improvements in resource efficiency, with GPU memory usage reduced by 56% (from 27GB to 12GB) and inference speed increased by 40% (from 1.1 seconds to 0.7 seconds) compared to mainstream methods [27][28]. - In unseen test environments, UNeMo achieves a navigation success rate (SR) of 72.5%, surpassing NavGPT2's 71% by 1.5 percentage points, and improves path efficiency (SPL) from 60% to 61.3% [28][30]. Group 3: Robustness and Scalability - UNeMo shows a marked advantage in long-path navigation, with SR increasing by 5.6% for paths longer than 7 units, compared to a mere 1.2% increase for shorter paths [30][31]. - The framework's adaptability is validated across various navigation baselines and datasets, proving its scalability beyond LLM-based systems [32][33].