Workflow
Vidar模型
icon
Search documents
重磅!清华×生数发布机器人通用大模型Vidar,高效泛化复杂物理操作达SOTA水平
具身智能之心· 2025-07-27 09:37
Core Insights - A revolutionary breakthrough in embodied intelligence is marked by the collaboration between Tsinghua University and Shengshu Technology, resulting in the Vidar model, which enables the transition from virtual to real-world physical execution through few-shot generalization capabilities [2][4]. Group 1: Vidar Model Overview - Vidar is the world's first multi-view embodied base model that achieves systematic migration of video understanding capabilities to physical decision-making, significantly reducing the data requirements for robot generalization [4][8]. - The model can generalize to new robot bodies using only 20 minutes of real machine data, which is about 1/80 of the leading industry standard RDT and 1/1200 of π0.5, thus lowering the data threshold for large-scale generalization [4][8]. Group 2: Data Pyramid and Training Methodology - Vidar's architecture utilizes a three-tier data pyramid consisting of vast general video data, medium-scale embodied video data, and a small amount of robot-specific data, allowing for effective training and generalization [8][12]. - The unified observation space method integrates multi-view video stitching, enabling a comprehensive dialogue between massive internet data and specific robot tasks, thus achieving true multi-dimensional integration [14]. Group 3: Performance Metrics and Results - The Vidu model, after embodied pre-training, showed significant improvements in subject consistency, background consistency, and imaging quality, which supports few-shot generalization [13]. - Vidar achieved superior success rates in 16 common robotic tasks, particularly excelling in generalization capabilities for unseen tasks and backgrounds, demonstrating strong adherence to task instructions [27][29]. Group 4: Automation and Efficiency - The introduction of the Automated Task-Agnostic Random Actions (ATARA) method allows for the automated collection of task-agnostic action data, requiring only 10 hours of automated data collection to achieve full action space generalization for new robots [16]. - The AnyPos model, which utilizes high-precision prediction techniques, significantly enhances action execution accuracy, achieving a success rate close to 100% in real-world task trajectory replay tests, surpassing baseline performance by 33-44% [18][22].
训练数据爆减至1/1200!清华&生数发布国产视频具身基座模型,高效泛化复杂物理操作达SOTA水平
量子位· 2025-07-25 05:38
Core Viewpoint - The article discusses the breakthrough of the Vidar model developed by Tsinghua University and Shengshu Technology, which enables robots to learn physical operations through ordinary video, achieving a significant leap from virtual to real-world execution [3][27]. Group 1: Model Development and Capabilities - Vidar utilizes a base model called Vidu, which is pre-trained on internet-scale video data and further trained with millions of heterogeneous robot videos, allowing it to generalize quickly to new robot types with only 20 minutes of real robot data [4][10]. - The model addresses the challenges of data scarcity and the need for extensive multimodal data in current visual-language-action (VLA) models, significantly reducing the data requirements for large-scale generalization [5][6]. - Vidar's architecture includes a video diffusion model that predicts task-specific videos, which are then decoded into robotic arm actions using an inverse dynamics model [7][11]. Group 2: Training Methodology - The embodied pre-training method proposed by the research team integrates a unified observation space, extensive embodied data pre-training, and minimal target robot fine-tuning to achieve precise control in video tasks [10]. - The model's performance was validated through tests on the VBench video generation benchmark, showing significant improvements in subject consistency, background consistency, and imaging quality after embodied data pre-training [11][12]. Group 3: Action Execution and Generalization - The introduction of task-agnostic actions allows for easier data collection and generalization across tasks, eliminating the need for human supervision and annotation [13][15]. - The automated task-agnostic random actions (ATARA) method enables the collection of training data for previously unseen robots in just 10 hours, facilitating full action space generalization [15][18]. - Vidar demonstrated superior success rates in executing 16 common robotic tasks, particularly excelling in generalization to unseen tasks and backgrounds [25][27]. Group 4: Future Implications - The advancements made by Vidar lay a solid technical foundation for future service robots to operate effectively in complex real-world environments such as homes, hospitals, and factories [27]. - The model represents a critical bridge between virtual algorithm training and real-world autonomous actions, enhancing the integration of AI into physical tasks [27][28].