Workflow
视觉语言导航(VLN)
icon
Search documents
具身智能之心1v1论文辅导来啦~
具身智能之心· 2025-10-10 03:14
还在为论文选题抓耳挠腮?被数据建模折磨到头秃?面对导师批注手足无措?别慌!具身智能之心,资深导师 团队在线 "救援",一站式解决你的论文烦恼! 论文辅导上线了 【具身智能之心论文辅导重磅上线!多模态大模型/VLA/强化学习/VLN/数据采集/机器人仿真/端到端/diffusion 等顶会方向1V1定制化辅导】 CCF-A到CCF-C SCI一区到四区 EI/中文核心/毕业论文/申博等 具身智能体泛化(跨任务迁移、零样本适应、仿真环境构建) 3D高斯泼溅(3DGS)(实时渲染、动态场景建模、SLAM结合) 端到端具身智能体(决策闭环、多模态传感器融合) 具身合成数据生成(自动标注、域适应、数据增强) 为什么选择我们? ✅ 顶会/顶刊导师团队:来自CMU、Stanford、MIT等国内外名校的PhD及大厂研究员,覆盖ICML、ICLR、 CoRL、ICRA、NeurIPS、CVPR等顶级会议审稿经验。 你是否正在研究以下前沿领域却苦于突破瓶颈? 多模态大模型(视觉-语言预训练、跨模态推理) 视觉语言动作(VLA)(端到端、分层等) 视觉语言导航(VLN)(Embodied QA、指令跟随、场景理解) 机器人抓取与 ...
HA-VLN:具备动态多人互动的视觉语言导航基准与排行榜
具身智能之心· 2025-08-29 16:03
作者丨 Yifei Dong等 点击下方 卡片 ,关注" 具身智能之心 "公众号 动机 :传统VLN系统大多忽视了人类动态和部分可观测性,而现实世界中的导航场景往往涉及动态的人类活动,如人群移动、个人空间需求等。因此,提出了人类 感知的视觉语言导航(HA-VLN)任务,要求智能体在遵循自然语言指令的同时,能够应对动态的人类活动,预测人类运动,尊重个人空间,并调整路径以避免碰 撞。 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 主要贡献 研究背景 人类感知的视觉语言导航任务 任务动机与概述 编辑丨视觉语言导航 HAPS 2.0数据集 动机 :现有的模拟器要么忽视人类行为,要么将人类建模为静态障碍。HA-VLN模拟器通过在离散和连续的3D环境中放置多个动态移动的人类,解决了社会意识导 航中的长期挑战。它具有高保真度的运动、多人互动和现实世界的复杂性,如群体聚会、自发运动和个人空间限制。 概述 :HA-VLN模拟器基于HAPS 2.0数据集,利用486个运动序列,涵盖了室内和室外活动。它提供了两个互补模块 ...
具身智能论文速递 | 强化学习、VLA、VLN、世界模型等~
具身智能之心· 2025-07-08 12:54
Core Insights - The article discusses advancements in Vision-Language-Action (VLA) models through reinforcement learning (RL) techniques, specifically the Proximal Policy Optimization (PPO) algorithm, which significantly enhances the generalization capabilities of these models [2][4]. Group 1: VLA Model Enhancements - The application of PPO has led to a 42.6% increase in task success rates in out-of-distribution (OOD) scenarios [2]. - Semantic understanding success rates improved from 61.5% to 75.0% when encountering unseen objects [2]. - In dynamic interference scenarios, success rates surged from 28.6% to 74.5% [2]. Group 2: Research Contributions - A rigorous benchmark was established to evaluate the impact of VLA fine-tuning methods on generalization across visual, semantic, and execution dimensions [4]. - PPO was identified as superior to other RL algorithms like GRPO and DPO for VLA fine-tuning, with discussions on adapting these algorithms to meet the unique needs of VLA [4]. - An efficient PPO-based fine-tuning scheme was developed, utilizing a shared actor-critic backbone network, VLA model preheating, and minimal PPO training iterations [4]. - The study demonstrated that RL's generalization capabilities in VLA for semantic understanding and entity execution outperformed supervised fine-tuning (SFT), while maintaining comparable visual robustness [4]. Group 3: NavMorph Model - The NavMorph model was introduced as a self-evolving world model for vision-and-language navigation in continuous environments, achieving a success rate of 47.9% in unseen environments [13][15]. - The model incorporates a World-aware Navigator for inferring dynamic representations of the environment and a Foresight Action Planner for optimizing navigation strategies through predictive modeling [15]. - Experiments on mainstream VLN-CE benchmark datasets showed that NavMorph significantly enhanced the performance of leading models, validating its advantages in adaptability and generalization [15].
机器人视觉语言导航进入R1时代!港大联合上海AI Lab提出全新具身智能框架
量子位· 2025-06-25 00:33
Core Insights - The article discusses the advancements in visual language navigation technology, specifically the VLN-R1 model developed by the University of Hong Kong and Shanghai AI Lab, which enables robots to navigate complex environments using natural language instructions without relying on discrete maps [1][3]. Group 1: Performance and Efficiency - VLN-R1 demonstrates strong performance in the VLN-CE benchmark, surpassing the results of larger models with only a 2 billion parameter model after RFT training [2]. - In long-distance navigation tasks, VLN-R1 showcases "cross-domain transfer," achieving superior performance with only 10,000 RxR samples after pre-training on R2R, highlighting its data efficiency [2][15]. Group 2: Innovation in Navigation - The core challenge of visual language navigation (VLN) is to enable agents to autonomously complete navigation tasks based on natural language commands while integrating real-time visual perception [3]. - Traditional navigation systems rely on discrete topological maps, limiting their adaptability to complex environments and dynamic changes [4][5]. Group 3: Training Mechanisms - VLN-R1 employs a two-stage training approach combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) to enhance decision-making capabilities [7]. - The model utilizes a group comparison optimization (GRPO) method to generate multiple action plans for the same instruction, optimizing strategies based on relative performance [7]. - A time decay reward (TDR) mechanism is introduced to prioritize immediate actions, ensuring the model focuses on current obstacles before planning future steps [8][9]. Group 4: Data Set and Memory Management - The VLN-Ego dataset, created using the Habitat simulator, includes 630,000 R2R and 1.2 million RxR training samples, emphasizing first-person perspectives and real-time decision-making [12]. - A long-short term memory sampling strategy is implemented to balance recent experiences with long-term memory, allowing the model to respond effectively to sudden changes in the environment [14]. Group 5: Future Implications - The research indicates that the key to embodied intelligence lies in creating a closed-loop learning system that mimics human perception, decision-making, and action [16]. - The framework's reproducibility and scalability are enhanced with the open availability of the VLN-Ego dataset and training methods, promoting the transition of AI from "digital intelligence" to "embodied cognition" across various applications [16].