具身智能领域最新世界模型综述：250篇paper带大家梳理主流框架与任务

Core Insights - The article discusses the concept of world models in embodied AI, emphasizing their role as internal simulators that help agents perceive environments, take actions, and predict future states [1][2]. Group 1: World Models Overview - The research on world models has seen unprecedented growth due to the explosion of generative models, leading to a complex array of architectures and techniques lacking a unified framework [2]. - A novel three-axis classification method is proposed to categorize existing world models based on their functionality, temporal modeling, and spatial representation [6]. Group 2: Mathematical Principles - World models are typically modeled as partially observable Markov decision processes (POMDPs), focusing on learning compact latent states from partial observations and the transition dynamics between states [4]. - The training paradigm for world models often employs a "reconstruction-regularization" approach, which encourages the model to reconstruct observations from latent states while aligning posterior inference with prior predictions [9]. Group 3: Functional Positioning - World models can be categorized into decision-coupled and general-purpose types, with the former optimized for specific decision tasks and the latter serving as task-agnostic simulators [6][15][16]. - Decision-coupled models, like the Dreamer series, excel in task performance but may struggle with generalization due to their task-specific representations [15]. - General-purpose models aim for broader predictive capabilities and transferability across tasks, though they face challenges in computational complexity and real-time inference [16]. Group 4: Temporal Modeling - Temporal modeling can be divided into sequential reasoning and global prediction, with the former focusing on step-by-step simulation and the latter predicting entire future sequences in parallel [20][23]. - Sequential reasoning is beneficial for closed-loop control but may suffer from error accumulation over long predictions [20]. - Global prediction enhances computational efficiency and reduces error accumulation but may lack detailed local dynamics [23]. Group 5: Spatial Representation - Various strategies for spatial representation include global latent vectors, token feature sequences, spatial latent grids, and decomposed rendering representations [25][28][34][35]. - Global latent vectors compress scene states into low-dimensional variables, facilitating real-time control but potentially losing fine-grained spatial information [28]. - Token feature sequences allow for detailed representation of complex scenes but require extensive data and computational resources [29]. - Spatial latent grids maintain local topology and are prevalent in autonomous driving, while decomposed rendering supports high-fidelity image generation but struggles with dynamic scenes [34][35]. Group 6: Data Resources and Evaluation Metrics - Data resources for embodied AI can be categorized into simulation platforms, interactive benchmarks, offline datasets, and real robot platforms, each serving distinct purposes in training and evaluating world models [37]. - Evaluation metrics focus on pixel-level generation quality, state/semantic consistency, and task performance, with recent trends emphasizing physical compliance and causal consistency [40].