当世界模型不止「视频」该如何评估?WorldLens提出实用化评估新框架
机器之心·2025-12-23 09:36

Core Viewpoint - The article discusses the development of a new evaluation framework for world models called WorldLens, which aims to assess existing open-source world models across five dimensions: Generation, Reconstruction, Action-Following, Downstream Tasks, and Human Preference. This framework addresses the limitations of traditional video quality metrics by focusing on the attributes necessary for practical applications in simulation, planning, data synthesis, and closed-loop decision-making [3][10][34]. Group 1: Evaluation Framework - WorldLens is the first systematic evaluation framework that assesses world models from multiple dimensions, including Generation, Reconstruction, Action-Following, Downstream Tasks, and Human Preference [3]. - The framework aims to provide a comprehensive evaluation that goes beyond mere video quality, focusing on the stability, consistency, and usability of generated models in real-world applications [10][12]. Group 2: Aspects of Evaluation - Aspect 1: Generation - The evaluation focuses on whether the generated visuals are credible across object, time, semantics, geometry, and multi-view perspectives, rather than just visual fidelity [15]. - Aspect 2: Reconstruction - It assesses whether the generated world can be reconstructed into a stable 4D structure, checking for consistency and accuracy from new viewpoints [16]. - Aspect 3: Action-Following - This aspect evaluates if the generated world can be effectively used by planners, particularly under closed-loop conditions, where errors can accumulate and lead to failures [19]. - Aspect 4: Downstream Tasks - The framework tests the utility of synthetic data in real-world tasks, revealing that visually appealing models may not perform well in practical applications, with performance drops reported between 30-50% [20]. - Aspect 5: Human Preference - WorldLens incorporates human judgment into the evaluation process, creating a dataset that captures subjective assessments of credibility, reasonableness, and safety [22][23]. Group 3: Insights and Implications - The evaluation reveals that different models exhibit significant capability gaps across aspects, indicating that a model excelling in one area may not perform well in others, highlighting the non-linear nature of world model capabilities [26]. - Geometric and temporal stability are identified as common bottlenecks that affect multiple aspects, emphasizing the importance of structural coherence in world models [27][28]. - The findings suggest that human assessments can be structured and learned, providing a pathway for improving world models through preference alignment [31]. Group 4: Conclusion - The article concludes that as world models transition from generating visually appealing segments to constructing interactive worlds, the evaluation must evolve to encompass world attributes, making WorldLens a crucial tool for future developments in this field [34].

当世界模型不止「视频」该如何评估?WorldLens提出实用化评估新框架 - Reportify