Workflow
人类感知的视觉语言导航(HA-VLN)
icon
Search documents
HA-VLN:具备动态多人互动的视觉语言导航基准与排行榜
具身智能之心· 2025-08-29 16:03
Core Insights - The article introduces the Human-Aware Visual Language Navigation (HA-VLN) task, which requires agents to navigate dynamic environments while following natural language instructions, addressing the limitations of traditional Visual Language Navigation (VLN) systems that often overlook human dynamics and partial observability [6][8][9]. Research Background - The motivation behind HA-VLN is to enhance navigation systems by incorporating human dynamics, such as crowd movement and personal space requirements, which are often ignored in existing systems [6][8]. - The HA-VLN benchmark unifies discrete and continuous navigation paradigms under social awareness constraints, providing standardized task definitions, upgraded datasets, and extensive benchmarking [8][9]. HA-VLN Simulator - The HA-VLN simulator is based on the HAPS 2.0 dataset, featuring 486 motion sequences and designed to address long-standing challenges in social-aware navigation by simulating multiple dynamic humans in both discrete and continuous 3D environments [12][14]. - The simulator includes two complementary modules: HA-VLN-CE for continuous navigation and HA-VLN-DE for discrete navigation, both sharing a unified API for consistent human state queries and dynamic scene updates [12][14]. Human Perception Constraints - The HA-VLN task incorporates dynamic human models that update in real-time, requiring agents to respect personal space and adapt to human movements [9][12]. - The task is framed as a partially observable Markov decision process (POMDP), where agents must infer unobserved factors and balance exploration and exploitation to reach their goals efficiently [9][12]. Real-World Validation and Leaderboard - The research includes real-world validation through physical robots navigating crowded indoor spaces, demonstrating the transferability from simulation to reality and establishing a public leaderboard for comprehensive evaluation [8][34]. - The HA-R2R dataset, an extension of the existing R2R-CE dataset, includes 16,844 carefully curated instructions that emphasize social nuances, such as conversations and near-collision events [28][34]. Experimental Results - The experiments highlight the significant performance gains when integrating models for HA-VLN tasks, with notable improvements in success rates and collision rates across various configurations [40][41]. - The results indicate that agents trained on HA-VLN outperform those trained solely on traditional VLN tasks, confirming the robustness of the HA-VLN framework in real-world conditions [51]. Future Work - Future research will focus on enhancing agents' predictive capabilities regarding human behavior and testing in more complex and dynamic environments, with potential applications in service robotics and autonomous vehicles [51].