Workflow
训练数据爆减至1/1200!清华&生数发布国产视频具身基座模型,高效泛化复杂物理操作达SOTA水平
量子位·2025-07-25 05:38

Core Viewpoint - The article discusses the breakthrough of the Vidar model developed by Tsinghua University and Shengshu Technology, which enables robots to learn physical operations through ordinary video, achieving a significant leap from virtual to real-world execution [3][27]. Group 1: Model Development and Capabilities - Vidar utilizes a base model called Vidu, which is pre-trained on internet-scale video data and further trained with millions of heterogeneous robot videos, allowing it to generalize quickly to new robot types with only 20 minutes of real robot data [4][10]. - The model addresses the challenges of data scarcity and the need for extensive multimodal data in current visual-language-action (VLA) models, significantly reducing the data requirements for large-scale generalization [5][6]. - Vidar's architecture includes a video diffusion model that predicts task-specific videos, which are then decoded into robotic arm actions using an inverse dynamics model [7][11]. Group 2: Training Methodology - The embodied pre-training method proposed by the research team integrates a unified observation space, extensive embodied data pre-training, and minimal target robot fine-tuning to achieve precise control in video tasks [10]. - The model's performance was validated through tests on the VBench video generation benchmark, showing significant improvements in subject consistency, background consistency, and imaging quality after embodied data pre-training [11][12]. Group 3: Action Execution and Generalization - The introduction of task-agnostic actions allows for easier data collection and generalization across tasks, eliminating the need for human supervision and annotation [13][15]. - The automated task-agnostic random actions (ATARA) method enables the collection of training data for previously unseen robots in just 10 hours, facilitating full action space generalization [15][18]. - Vidar demonstrated superior success rates in executing 16 common robotic tasks, particularly excelling in generalization to unseen tasks and backgrounds [25][27]. Group 4: Future Implications - The advancements made by Vidar lay a solid technical foundation for future service robots to operate effectively in complex real-world environments such as homes, hospitals, and factories [27]. - The model represents a critical bridge between virtual algorithm training and real-world autonomous actions, enhancing the integration of AI into physical tasks [27][28].