如何做到的？20分钟机器人真机数据，即可跨本体泛化双臂任务

Core Insights - Vidar represents a significant breakthrough in the field of embodied intelligence, being the first global model to transfer video understanding capabilities to physical decision-making systems [2] - The model innovatively constructs a multi-view video prediction framework that supports collaborative tasks for dual-arm robots, demonstrating state-of-the-art performance while exhibiting significant few-shot learning advantages [2] - The model requires only 20 minutes of real robot data to generalize quickly to new robot bodies, significantly reducing the data requirements compared to industry-leading models [2][6] Group 1 - Vidar is based on a general video model and achieves systematic migration of video understanding capabilities [2] - The model's data requirement is approximately one-eighth of the leading RDT model and one-thousand-two-hundredth of π0.5, greatly lowering the barrier for large-scale generalization in robotics [2] - After fine-tuning, the model can perform multi-view dual-arm tasks effectively, executing commands as instructed [2] Group 2 - The Tsinghua University team proposed a new paradigm to address challenges in embodied intelligence, breaking down tasks into "prediction + execution" [6] - This approach utilizes visual generative models like Vidar to learn target predictions from vast amounts of internet video, while employing task-agnostic inverse dynamics models like Anypos for action execution [6] - The method significantly reduces the dependency on large-scale paired action-instruction data, requiring only 20 minutes of task data to achieve high generalization [6] Group 3 - The presentation includes an overview and demonstration video, discussing the rationale for utilizing video modalities and considering embodied video base models [8] - It covers the training of Vidar and the concept of task-agnostic actions with AnyPos [8] - The speaker, Hengkai Tan, is a PhD student at Tsinghua University, focusing on the integration of embodied large models and multi-modal large models [11]