Workflow
视觉-语言-大模型(LVLMs)
icon
Search documents
小模型逆袭!复旦&创智邱锡鹏团队造出「世界感知」具身智能体,代码数据完全开源!
具身智能之心· 2025-07-16 09:12
Core Viewpoint - The article discusses the introduction of the World-Aware Planning Narrative Enhancement (WAP) framework, which significantly improves the performance of large vision-language models (LVLMs) in embodied planning tasks by integrating world knowledge into the data and reasoning chain [2][17]. Group 1: Introduction - LVLMs are becoming central in embodied planning, but existing methods often rely on environment-agnostic imitation learning, leading to poor performance in unfamiliar scenarios [2]. - The WAP framework has shown a success rate increase from 2% to 62.7% on the EB-ALFRED benchmark, surpassing models like GPT-4o and Claude-3.5-Sonnet, highlighting the importance of world perception in high-level planning [2][17]. Group 2: Related Work - WAP differs from existing approaches by explicitly binding instruction-environment context at the data level and relying solely on visual feedback without privileged information [4]. Group 3: Technical Method - The framework injects four-dimensional cognitive narratives (visual, spatial, functional, syntactic) into the data layer, allowing the model to understand the environment before reasoning deeply [6]. - It employs closed-loop observation (only RGB + instructions) and a three-stage curriculum learning approach to develop environmental understanding and long-term reasoning capabilities [6][12]. Group 4: Experiments - The performance comparison on the EmbodiedBench (EB-ALFRED) shows that the WAP approach significantly enhances success rates across various task categories, with Qwen2.5-VL achieving a 60.7 percentage point increase in average success rate [14]. - The WAP framework demonstrates a notable improvement in long-term task success rates, achieving 70% compared to previous models [14][16]. Group 5: Conclusion and Future Work - WAP effectively incorporates world knowledge into the data and reasoning processes, allowing smaller open-source LVLMs to outperform commercial models in pure visual closed-loop settings [17]. - Future work includes expanding to dynamic industrial/outdoor scenes and exploring self-supervised narrative evolution for data-model iterative improvement [21].