Workflow
具身规划
icon
Search documents
小模型逆袭!复旦&创智邱锡鹏团队造出「世界感知」具身智能体~
自动驾驶之心· 2025-07-17 02:19
Core Viewpoint - The article discusses the introduction of the World-Aware Planning Narrative Enhancement (WAP) framework, which significantly improves the performance of large vision-language models (LVLMs) in embodied planning tasks by integrating four-dimensional cognitive narratives and closed-loop observation methods [3][16]. Group 1: Introduction - LVLMs are becoming central in embodied planning, but existing methods often rely on environment-agnostic imitation learning, leading to poor performance in unfamiliar scenarios [3]. - WAP aims to enhance model capabilities by injecting four-dimensional cognitive narratives (visual, spatial, functional, syntactic) into the data layer, allowing models to better understand their environment before reasoning [3][4]. Group 2: Technical Methodology - WAP's main distinction is its explicit binding of instructions to environmental context, relying solely on visual closed-loop feedback without privileged information [6]. - The framework employs a three-stage curriculum learning approach, using only RGB observations and no privileged feedback to train the model [12]. Group 3: Experimental Results - The Qwen2.5-VL model achieved a success rate increase from 2% to 62.7% (+60.7 percentage points) on the EB-ALFRED benchmark, surpassing models like GPT-4o and Claude-3.5 [4][14]. - The model demonstrated a long-range task success rate improvement from 0% to 70%, indicating the effectiveness of the WAP framework in complex planning scenarios [14]. - A case study illustrated WAP's ability to decompose complex instructions into manageable steps, showcasing its superiority over baseline models that failed to consider implicit conditions [15]. Group 4: Conclusion and Future Work - WAP successfully integrates "world knowledge" into data and reasoning chains, allowing small-scale open-source LVLMs to outperform commercial models in pure visual closed-loop settings [16]. - Future work includes enhancing continuous control, expanding to dynamic industrial/outdoor environments, and exploring self-supervised narrative evolution for iterative data-model improvement [17].