视觉世界模型
Search documents
CVPR 2026 | AI寒武纪时刻?字节世界模型新作,仅靠视觉学习真实世界知识
机器之心· 2026-03-07 11:20
Core Viewpoint - The article discusses the introduction of "VideoWorld 2," a visual world model developed by the Doubao model team in collaboration with Beijing Jiaotong University, which enables AI to learn complex real-world tasks directly from video data without relying on language models [2][4]. Group 1: Model Overview - VideoWorld 2 is designed to learn complex, long-sequence real-world knowledge solely through video observation, distinguishing itself from existing models that depend on language or labeled data [4][5]. - The model can successfully perform intricate tasks such as origami and building with LEGO, which require fine-grained operations and long-term planning, achieving a success rate over 70% higher than current leading technologies like Sora 2, Veo 3, and Wan 2.2 [4][21]. Group 2: Learning Mechanism - The key to VideoWorld 2's learning capability lies in decoupling critical actions from irrelevant visual details, utilizing a dynamic enhanced latent dynamic model (dLDM) to improve learning efficiency and effectiveness [4][16]. - The model employs a MAGVITv2-style encoder-decoder structure and a pre-trained video diffusion model (VDM) to compress and render video changes, focusing on core dynamic actions while avoiding overfitting to irrelevant visual details [16][18]. Group 3: Experimental Setup - The team constructed two experimental environments: video handcrafting and video robot manipulation, to evaluate the model's ability to understand control rules and plan tasks [8][9]. - The handcrafting videos include various scenes with intricate actions and environmental changes, serving as an ideal testing ground for assessing the model's complex knowledge learning capabilities [8]. Group 4: Results and Visualization - The dLDM was shown to extract similar motion patterns from a large number of real-world videos, enhancing the model's ability to learn generalizable strategies [22][25]. - UMAP visualization demonstrated that VideoWorld 2 could better cluster similar actions across different environments compared to its predecessor, indicating improved extraction of commonalities and more generalized knowledge [25]. Group 5: Future Directions - The team believes that visual learning is crucial for advancing AI towards higher intelligence, aiming to develop models that can autonomously perceive, reason, and act based on complex real-world knowledge structures [26].