Workflow
悟界・Emu3.5
icon
Search documents
刚刚,智源悟界·Emu3.5登场,原生具备世界建模能力
机器之心· 2025-10-30 08:52
Core Insights - The article discusses the release of the latest multimodal model, Emu3.5, by the Beijing Academy of Artificial Intelligence (BAAI), highlighting its capabilities and innovations in the field of AI [3][4][6]. Model Overview - Emu3.5 is defined as a "Multimodal World Foundation Model," which distinguishes itself from other generative models through its inherent world modeling capabilities [4][5]. - The model has been trained on over 10 trillion multimodal tokens, primarily sourced from internet videos totaling approximately 790 years in duration, allowing it to internalize the dynamic laws of the physical world [5][16]. Technological Innovations - Emu3.5 introduces the "Discrete Diffusion Adaptation" (DiDA) technology, which enhances image inference speed by nearly 20 times with minimal performance loss, making it competitive with top closed-source diffusion models [6][24]. - The model's architecture is based on a 34 billion parameter dense transformer, focusing on "Next-State Prediction" to unify its objectives [11][17]. Performance and Capabilities - Emu3.5 demonstrates state-of-the-art performance in various tasks, including image editing and generation, visual narrative creation, and visual guidance, outperforming competitors like Google's Gemini-2.5-Flash-Image [28][35]. - The model can generate coherent visual narratives and step-by-step visual tutorials, marking a significant advancement from traditional multimodal models [13][14]. Training Process - The training process consists of four core stages: large-scale pre-training, fine-tuning on high-quality datasets, large-scale multimodal reinforcement learning, and efficient autoregressive inference acceleration [17][21][22][24]. - The model's training data includes a vast array of visual-language interleaved data, allowing it to learn about physical dynamics and causality [16][41]. Future Implications - Emu3.5 is positioned as a foundational model for future developments in embodied intelligence, capable of generating diverse virtual environments and task planning data [39][41]. - The open-sourcing of Emu3.5 is expected to provide a robust new foundation for the global AI research community [7][45].