Workflow
EscapeCraft
icon
Search documents
密室逃脱成AI新考场,通关率不足50%,暴露空间推理短板丨清华ICCV25
量子位· 2025-07-12 04:57
Core Insights - The article discusses the rapid development of multimodal large language models (MLLMs) and their capabilities in complex visual reasoning tasks, particularly through a new evaluation platform called EscapeCraft [1][2]. EscapeCraft Environment - EscapeCraft is a 3D escape room environment designed to assess the reasoning abilities of MLLMs by requiring them to explore, find items, and unlock exits through integrating visual, spatial, and logical information [4][5]. - The platform allows for customizable difficulty levels and supports various tasks such as question answering, logical reasoning, and narrative reconstruction [6][5]. Model Performance Evaluation - The evaluation focuses on the entire task completion process rather than just the final outcome, assessing whether models can explore autonomously, avoid repeating mistakes, and effectively utilize tools [16]. - Metrics such as Intent-Outcome Consistency and various interaction ratios are introduced to measure the quality of model interactions and reasoning efficiency [17]. Model Comparison Results - The study compares several models, including GPT-4o, Gemini-1.5 Pro, and Claude 3.5, revealing that while GPT-4o has the highest escape success rate, it still makes frequent errors as task complexity increases [21][20]. - The results indicate that models often struggle with spatial awareness and decision-making, leading to unique failure patterns, such as misjudging interactive objects or failing to act on visible clues [22][18]. Conclusion - EscapeCraft serves as a versatile evaluation platform for future research in intelligent agents, multimodal reasoning, and reinforcement learning, providing a foundation for further advancements in the field [5][4].