RoboScape：基于物理信息的具身世界模型，动作可控性提升68.3%

Core Viewpoint - The article discusses the development of RoboScape, a physics-informed embodied world model that enhances video generation quality by integrating physical knowledge into the modeling process, addressing limitations in existing models related to physical perception and object manipulation [4][23]. Research Background and Core Issues - Existing models in embodied intelligence face significant limitations in physical perception, particularly in robot scenarios involving contact, leading to unrealistic object deformation and motion discontinuities [4]. - Current attempts to integrate physical knowledge are categorized into three types: physical prior regularization, knowledge distillation from physical simulators, and material field modeling, each with its own limitations [4]. Core Method - The focus is on learning an embodied world model as a dynamic function to predict the next visual observation based on past observations and robot actions [5]. Robot Data Processing Pipeline - A four-step processing pipeline is designed to create a multimodal dataset with physical priors based on the AGIBOT-World dataset [6]. RoboScape Model Architecture - The model utilizes a self-regressive Transformer framework for controllable robot video generation, integrating physical knowledge through two auxiliary tasks: physical attribute labeling and video slicing [8]. Time Depth Prediction - A time depth prediction branch is added to enhance 3D geometric consistency, employing a dual-branch cooperative self-regressive Transformer [10]. Adaptive Keypoint Dynamic Learning - The model employs self-supervised tracking of contact-driven keypoints to implicitly encode material properties, adapting to the most active keypoints based on motion amplitude [11]. Joint Training Objective - The overall training objective integrates various loss functions to balance the contributions of different components [13]. Experimental Validation - The model's performance is evaluated across three dimensions: appearance fidelity, geometric consistency, and action controllability, showing superior results compared to baseline models [15]. Dataset and Implementation Details - The dataset comprises 50,000 video segments covering 147 tasks and 72 skills, with training conducted on 32 NVIDIA A800 GPUs over five epochs [16]. Downstream Application Validation - In robot policy training, the model demonstrates performance close to real data training results, indicating the effectiveness of synthetic data for complex tasks [19]. Conclusion and Future Plans - RoboScape effectively integrates physical knowledge into video generation without relying on external physics engines, with plans to combine generative world models with real robots for further validation in practical scenarios [23][24].