任务无关动作

Search documents
重磅!清华×生数发布机器人通用大模型Vidar,高效泛化复杂物理操作达SOTA水平
具身智能之心· 2025-07-27 09:37
Core Insights - A revolutionary breakthrough in embodied intelligence is marked by the collaboration between Tsinghua University and Shengshu Technology, resulting in the Vidar model, which enables the transition from virtual to real-world physical execution through few-shot generalization capabilities [2][4]. Group 1: Vidar Model Overview - Vidar is the world's first multi-view embodied base model that achieves systematic migration of video understanding capabilities to physical decision-making, significantly reducing the data requirements for robot generalization [4][8]. - The model can generalize to new robot bodies using only 20 minutes of real machine data, which is about 1/80 of the leading industry standard RDT and 1/1200 of π0.5, thus lowering the data threshold for large-scale generalization [4][8]. Group 2: Data Pyramid and Training Methodology - Vidar's architecture utilizes a three-tier data pyramid consisting of vast general video data, medium-scale embodied video data, and a small amount of robot-specific data, allowing for effective training and generalization [8][12]. - The unified observation space method integrates multi-view video stitching, enabling a comprehensive dialogue between massive internet data and specific robot tasks, thus achieving true multi-dimensional integration [14]. Group 3: Performance Metrics and Results - The Vidu model, after embodied pre-training, showed significant improvements in subject consistency, background consistency, and imaging quality, which supports few-shot generalization [13]. - Vidar achieved superior success rates in 16 common robotic tasks, particularly excelling in generalization capabilities for unseen tasks and backgrounds, demonstrating strong adherence to task instructions [27][29]. Group 4: Automation and Efficiency - The introduction of the Automated Task-Agnostic Random Actions (ATARA) method allows for the automated collection of task-agnostic action data, requiring only 10 hours of automated data collection to achieve full action space generalization for new robots [16]. - The AnyPos model, which utilizes high-precision prediction techniques, significantly enhances action execution accuracy, achieving a success rate close to 100% in real-world task trajectory replay tests, surpassing baseline performance by 33-44% [18][22].