RoboScape模型
Search documents
RoboScape:基于物理信息的具身世界模型,动作可控性提升68.3%
具身智能之心· 2025-07-02 10:18
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Yu Shang等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 根源在于现有模型过度依赖视觉令牌拟合,缺乏物理知识 awareness。此前整合物理知识的尝试分为三类: 物理先验正则化(局限于人类运动或刚体动力学等窄域)、基于物理模拟器的知识蒸馏(级联 pipeline 计 算复杂)、材料场建模(限于物体级建模,难用于场景级生成)。因此,如何在统一、高效的框架中整合 物理知识,成为亟待解决的核心问题。 核心方法 问题定义 聚焦机器人操作场景,学习具身世界模型 作为动力学函数,基于过去的观测 和机器人动作 预测 下一个视觉观测 ,公式为: 研究背景与核心问题 在具身智能领域,世界模型作为强大的模拟器,能生成逼真的机器人视频并缓解数据稀缺问题,但现有模 型在物理感知上存在显著局限。尤其在涉及接触的机器人场景中,因缺乏对3D几何和运动动力学的建模能 力,生成的视频常出现不真实的物体变形或 ...
清华大学最新!RoboScape:基于物理信息的具身世界模型,动作可控性提升68.3%
具身智能之心· 2025-07-02 07:44
Core Insights - The article discusses the limitations of existing embodied intelligence models in physical perception, particularly in robot scenarios involving contact, highlighting the need for better integration of physical knowledge into these models [3][20]. Research Background and Core Issues - Current models rely heavily on visual token fitting and lack physical knowledge awareness, leading to unrealistic object deformation and motion discontinuities in generated videos [3]. - Previous attempts to integrate physical knowledge have been limited to narrow domains or complex pipelines, indicating a need for a unified and efficient framework [3]. Core Methodology - The focus is on learning an embodied world model as a dynamic function to predict the next visual observation based on past observations and robot actions [4]. - A four-step processing pipeline is designed to create a multimodal dataset with physical priors, utilizing the AGIBOT-World dataset [5]. Data Processing Pipeline - The pipeline includes physical attribute annotation, video slicing, segment filtering, and segment classification to ensure effective training data [5][8]. Time Depth Prediction - A dual-branch cooperative autoregressive Transformer (DCT) is introduced to enhance 3D geometric consistency, ensuring causal generation through temporal and spatial attention layers [7]. Adaptive Keypoint Dynamic Learning - The model employs self-supervised tracking of contact-driven keypoints to implicitly encode material properties, enhancing the modeling of object deformation and motion patterns [8]. Joint Training Objectives - The overall training objective integrates various loss functions to balance the contributions of different components in the model [10]. Experimental Validation - The model's performance is evaluated across appearance fidelity, geometric consistency, and action controllability, demonstrating superior results compared to baseline models [12][18]. Dataset and Implementation Details - The study utilizes the AgiBotWorldBeta dataset, comprising 50,000 video segments across 147 tasks, and employs advanced models for comparison [13]. Downstream Application Validation - The model shows effectiveness in training robot policies, achieving performance close to real data training results, indicating the utility of generated data for complex tasks [16]. Conclusion and Future Plans - RoboScape effectively integrates physical knowledge into video generation without relying on external physics engines, with plans to combine generative world models with real robots for further validation [20][21].