Workflow
视觉 - 语言导航(VLN)
icon
Search documents
首个基于3DGS的VLN具身学习数据集,群核科技联合浙大开源SAGE-3D
具身智能之心· 2025-12-25 04:01
Core Insights - The article discusses the advancements in embodied intelligence, particularly focusing on the SAGE-3D dataset and its implications for visual language navigation (VLN) tasks. It highlights the transition of 3DGS technology from a mere rendering tool to a functional navigation environment that incorporates semantic and physical attributes, enabling robots to understand and interact with their surroundings effectively [2][3][30]. Group 1: 3DGS Technology and Its Limitations - Embodied data is recognized as a core asset in robotics, with the ability to generate high-quality data being crucial for competitive advantage [2]. - 3DGS technology generates realistic 3D point cloud models from real scenes but lacks essential physical information such as area, size, and geometric structure, limiting its application in navigation tasks [2][9]. - The introduction of the SAGE-3D dataset addresses the limitations of traditional 3DGS by providing a navigable environment that includes physical collision detection, allowing robots to interpret complex instructions and navigate safely [3][10]. Group 2: SAGE-3D Dataset and Its Features - SAGE-3D consists of two main components: the InteriorGS dataset, which includes 1,000 finely annotated indoor scenes with over 554,000 object instances, and the SAGE-Bench, a benchmark for VLN tasks with 2 million trajectory-instruction pairs [13][14]. - The dataset supports a hierarchical instruction generation framework that combines high-level semantic goals with low-level action commands, enhancing the robot's ability to follow complex instructions [18][22]. - SAGE-3D's hybrid representation of 3DGS allows for high-fidelity rendering while embedding physical properties, enabling robots to interact with their environment without issues like mesh penetration [22][30]. Group 3: Performance and Evaluation - Models trained on SAGE-3D, such as NaVILA-SAGE, demonstrate superior performance in VLN tasks, achieving a success rate of 0.46, significantly higher than traditional models [21][23]. - The SAGE-Bench platform introduces new evaluation metrics that capture the nuances of navigation performance, such as continuous success rates and collision penalties, providing a more comprehensive assessment of model capabilities [27][29]. - The SAGE-3D dataset shows strong generalization capabilities, with models trained exclusively on it outperforming baseline models in unseen scenarios, indicating its effectiveness in real-world applications [26]. Group 4: Future Implications - The advancements represented by SAGE-3D redefine the application boundaries of 3DGS technology, paving the way for more complex outdoor scenarios and multi-robot collaboration [30][31]. - The integration of semantic and physical capabilities into 3DGS not only enhances robot navigation but also supports the development of more sophisticated embodied intelligence systems [31].
深大团队让机器人听懂指令精准导航!成功率可达72.5%,推理效率提升40%|AAAI2026
量子位· 2025-12-10 04:26
Core Insights - The article discusses the introduction of a new framework called UNeMo for visual-language navigation (VLN), developed by a team led by Professor Li Jianqiang from Shenzhen University in collaboration with other institutions [1][4]. Group 1: Framework Overview - UNeMo utilizes a multi-modal world model (MWM) and a hierarchical predictive feedback navigator (HPFN) to enhance navigation capabilities by allowing agents to predict future visual states and make informed decisions [3][11]. - The framework addresses the disconnection between language reasoning and visual navigation, which has been a challenge in existing methods [8][9]. Group 2: Performance Metrics - UNeMo demonstrates a navigation success rate of 72.5% in unseen environments, outperforming the previous method NavGPT2, which had a success rate of 71% [4][26]. - The model's resource efficiency is notable, with GPU memory usage reduced by 56% from 27GB to 12GB and an improvement in inference speed by 40% [24]. Group 3: Robustness in Complex Scenarios - UNeMo shows significant advantages in long-path navigation, with a success rate increase of 5.6% for paths longer than 7 units, compared to a minor increase of 1.2% for shorter paths [28][29]. - This improvement indicates that UNeMo effectively mitigates cumulative errors in long-distance navigation tasks [30]. Group 4: Scalability and Adaptability - The framework has been tested across various navigation baselines and datasets, demonstrating its adaptability and scalability beyond LLM-based systems [31][33]. - UNeMo's collaborative training architecture allows it to perform well in diverse task scenarios, enhancing its overall value [34].