小模型也能超越GPT-4o！邱锡鹏团队WAP框架打造「世界感知」智能体

Core Insights - The article discusses the potential of large-scale vision-language models (LVLM) in embodied planning tasks, highlighting the challenges they face in unfamiliar environments and complex multi-step goals [2][6] - A new framework called World-aware Planning (WAP) is introduced, which enhances LVLMs by integrating four cognitive abilities: visual appearance modeling, spatial reasoning, functional abstraction, and syntax grounding [2][6] - The enhanced model, Qwen2.5-VL, achieved a 60.7% absolute task success rate improvement in the EB-ALFRED benchmark, particularly excelling in common-sense reasoning (+60.0%) and long-term planning (+70.0%) [2][6] Summary by Sections Introduction - The article emphasizes the breakthroughs in multimodal models but notes the significant challenges they still face in embodied planning tasks [6] Framework Innovation - The WAP framework is presented as a novel approach that integrates four key cognitive abilities to improve the understanding of the physical world by AI [7] Performance Metrics - The open-source model Qwen2.5-VL significantly outperformed proprietary systems like GPT-4o and Claude-3.5-Sonnet, showcasing a substantial leap in performance [2][6][7] Future Implications - The advancements in embodied planning through the WAP framework open new possibilities for AI applications in real-world scenarios [6][7]