训练仍有巨大的Scaling空间！智源研究院王仲远：视频数据还未被充分利用

Core Viewpoint - The article discusses the transition of artificial intelligence (AI) from the digital world to the physical world, marking a critical turning point in the third wave of AI development, with the introduction of the "Wujie" series of large models by the Zhiyuan Institute [12][13][14]. Group 1: AI Development and Trends - The current AI landscape is at a pivotal moment where large models are facilitating the shift from weak AI to general AI, and from specialized robots (1.0) to general embodied intelligence (2.0) [3][13]. - The "Wujie" series of large models aims to bridge the gap between the digital and physical worlds, representing a significant advancement in AI capabilities [4][14]. - The Emu3.5 model, part of the Wujie series, utilizes a unified autoregressive architecture to transition from Next-Token Prediction to Next-State Prediction, indicating a new phase in multimodal learning [17][22]. Group 2: Emu3.5 Model Features - Emu3.5 distinguishes itself by learning from long videos, which contain rich temporal, spatial, and causal information, essential for understanding the physical world [18][20]. - The training dataset for Emu3.5 has significantly expanded, increasing from 15 years to 790 years of video data, and the model parameters have grown from 8 billion to 34 billion [23]. - Emu3.5's autoregressive architecture allows for rapid image generation, achieving speeds comparable to top models through proprietary DiDA technology [23]. Group 3: Multimodal Learning and Applications - Emu3.5 is expected to lead AI into a new stage of multimodal world learning, with substantial scaling potential due to the underutilization of vast multimodal data [24]. - The model demonstrates strong multimodal reasoning and visual understanding capabilities, as evidenced by its performance in image generation and editing tasks [25][27]. - Emu3.5 excels in tasks involving temporal and spatial state predictions, showcasing its superior understanding of the physical world [29][31]. Group 4: Embodied Intelligence and Technological Advancements - The Zhiyuan Institute is addressing the challenges of embodied intelligence, which currently suffers from usability and generality issues [34]. - The institute has developed a comprehensive technology stack centered around the Robo Brain, enabling cross-robot data collection and standardization [35]. - Recent advancements include the RoboBrain2.0, which can decompose complex human instructions for execution by various robots, enhancing the practical applications of embodied intelligence [36]. Group 5: Open Source Contributions - The Zhiyuan Institute has committed to open-source practices, releasing over 200 models and 100 datasets, with global download figures exceeding 690 million and 4 million, respectively [38]. - The institute collaborates with over 30 leading robotics companies to promote the development of embodied intelligence world models [38].