Workflow
Scaling范式
icon
Search documents
智源悟界·Emu3.5发布,开启“下一个状态预测”!王仲远:或开启第三个 Scaling 范式
AI前线· 2025-11-01 05:33
Core Insights - The article discusses the launch of the world's first native multimodal world model, Emu3, by Zhiyuan Research Institute, which predicts the next token without diffusion models or combination methods, achieving a unified approach to images, text, and video [2] - Emu3.5, released a year later, enhances the model's capabilities by simulating human natural learning and achieving generalized world modeling ability through Next-State Prediction (NSP) [2][3] - The core of the world model is the prediction of the next spatiotemporal state, which is crucial for embodied intelligence [2] Model Features - Emu3.5 has three main characteristics: understanding high-level human intentions and generating detailed action paths, seamless integration of world understanding, planning, and simulation, and providing a cognitive foundation for generalized interaction between AI and humans or physical environments [3] - The model's architecture allows for the integration of visual and textual tokens, enhancing its scalability and performance [8] Technological Innovations - Emu3.5 underwent two phases of pre-training on approximately 13 trillion tokens, focusing on visual resolution diversity and data quality, followed by supervised fine-tuning on 150 billion samples [12][13] - A large-scale native multimodal reinforcement learning system was developed, featuring a comprehensive reward system that balances multiple quality standards and avoids overfitting [14] - The introduction of DiDA technology significantly accelerated inference speed by 20 times, allowing the autoregressive model to compete with diffusion models in performance [17][19] Industry Impact - The evolution from Emu3 to Emu3.5 demonstrates the potential for scaling in the multimodal field, similar to advancements seen in language models [6] - Emu3.5 represents a significant original innovation in the AI large model field, combining algorithmic, engineering, and data training innovations [9] - The model's ability to understand causal relationships and spatiotemporal dynamics positions it uniquely in the landscape of AI models, potentially opening a new avenue for large models [20]