DiT4DiT
Search documents
离职特斯拉“隐身”14个月,杨硕创业终于亮牌:重新定义机器人训练范式
量子位· 2026-03-24 23:52
Core Viewpoint - Yang Shuo, co-founder and CTO of Mondo Robotics, has remained silent since leaving Tesla's Optimus team over a year ago, but recently unveiled the company's work on a new model called DiT4DiT, which focuses on training robots using video to enhance their action capabilities and adaptability in various scenarios [1][2]. Group 1: DiT4DiT Model Overview - DiT4DiT is an end-to-end model that integrates video diffusion and action diffusion into a cascading framework for robot learning [9]. - The model employs a unique design called "intermediate denoising," which extracts key features during the video generation process to guide robot action decisions without waiting for a complete video output [11][12]. - The model's performance has been validated, achieving a 98.6% average success rate on the LIBERO benchmark, demonstrating its state-of-the-art capabilities [30]. Group 2: Key Design Features - The model's two critical designs include intermediate denoising and a three-timestep scheme, which allows for efficient training of both video generation and action prediction tasks [10][25]. - The intermediate denoising process involves extracting features from a specific layer during the denoising stages, optimizing the robot's ability to understand physical interactions rather than relying on complete video clarity [19][22]. - The three-timestep scheme enables the video model and action model to operate independently yet cohesively, improving convergence speed by 7 times and data efficiency by over 10 times [29]. Group 3: Practical Applications and Performance - DiT4DiT has been deployed on the Yuzhu G1 humanoid robot, successfully completing tasks such as flower arrangement and drawer interactions, outperforming pre-trained models and demonstrating superior deployment potential on robot edge chips [41][42][43]. - The model's design allows it to adapt quickly to new objects and scenarios, addressing limitations of traditional visual-language-action models that struggle with dynamic physical understanding [36][40].