World-in-World：约翰霍普金斯 × 北大联合提出闭环下的具身世界模型评估框架！

Core Insights - The article emphasizes the need to redefine the evaluation of world models in embodied intelligence, focusing on their practical utility rather than just visual quality [2][23] - The introduction of the "World-in-World" platform aims to test world models in real embodied tasks through a closed-loop interaction system, addressing the gap between visual quality and task effectiveness [3][23] Evaluation Redefinition - Current evaluation systems prioritize visual clarity and scene rationality, often rewarding models that produce high-quality visuals without assessing their decision-making capabilities in real tasks [2][23] - The article highlights the importance of aligning actions and predictions in embodied tasks, where the model must accurately predict scene changes based on the agent's movements [2][3] World-in-World Platform Design - The platform creates a closed-loop system where the agent, world model, and environment interact in a cycle of observation, decision-making, execution, and re-observation [3][6] - A unified action API is established to standardize input across different world models, ensuring consistent interpretation of action intentions [6][12] Task Evaluation - Four types of real-world embodied tasks are selected for comprehensive testing, each with defined scenarios, objectives, and scoring criteria [10][14] - The platform incorporates post-training techniques to fine-tune models using task-specific data, enhancing their adaptability to real-world tasks [12][23] Experimental Findings - Experiments with 12 mainstream world models reveal that task data fine-tuning is more effective than simply using larger pre-trained models, demonstrating significant improvements in success rates [17][20] - The article notes that models with high visual quality do not necessarily perform better in practical tasks, emphasizing the importance of controllability over visual appeal [18][23] Recommendations for Future Development - The article suggests focusing on improving controllability, utilizing task data for low-cost enhancements, and addressing the shortcomings in physical modeling for operational tasks [23][22]