LeCun出手，造出视频世界模型，挑战英伟达COSMOS

Core Viewpoint - The article discusses the development and advantages of a new video world model called DINO-world, which aims to improve the efficiency and effectiveness of predicting future frames in various environments, particularly in the context of artificial intelligence and machine learning [9][10]. Data Challenges - The acquisition of large-scale, high-quality video datasets is costly, especially when action annotations are required. Current successful applications of world models are limited to specific fields like autonomous driving and video games [5]. - Accurately modeling physical laws and behaviors in unconstrained, partially observable environments remains a significant challenge, even for short time scales. Advanced pixel-based generative models consume enormous computational resources, with training times reaching up to 22 million GPU hours for models like COSMOS [6]. Model Development - DINO-world utilizes a frozen visual encoder (DINOv2) to pre-train the video world model in a latent space, followed by fine-tuning with action data for planning and control [9]. - The architecture of DINO-world significantly reduces resource consumption during both training and inference phases compared to current state-of-the-art models [10]. Training and Evaluation - DINO-world was trained on a large dataset of approximately 60 million uncleaned network videos, enabling it to learn transferable features across different domains [11]. - In the VSPW segmentation prediction task, DINO-world achieved a mean Intersection over Union (mIoU) improvement of 6.3% when predicting future frames, outperforming the second-best model [13]. Methodology - The model employs a frame encoder that does not directly model pixels but instead uses latent representations based on video patches, which significantly lowers the computational cost of training the predictor [19]. - The training objective is set as "next frame prediction," allowing for efficient parallelization and focusing on the most relevant tokens for loss calculation [27]. Action-Conditioned Fine-Tuning - DINO-world can be adapted for action-conditioned tasks by incorporating an action module that updates the query vector based on the corresponding actions, which can be trained on a small dataset of action-conditioned trajectories [30][33]. Experimental Results - DINO-world demonstrated superior performance in dense prediction tasks across various datasets, including Cityscapes, VSPW, and KITTI, validating the effectiveness of the proposed paradigm [37][38]. - The model's performance in intuitive physics tests showed a strong understanding of physical behaviors, comparable to larger models like V-JEPA [40][41]. Planning Evaluation - The action-conditioned model was trained on offline trajectories, showing significant performance improvements compared to models trained from scratch, particularly in more complex environments [44].