世界模型有了开源基座Emu3.5!拿下多模态SOTA,性能超越Nano Banana
量子位·2025-10-30 10:31

Core Insights - The article discusses the launch of the latest open-source native multimodal world model, Emu3.5, developed by the Beijing Academy of Artificial Intelligence (BAAI) [1] - Emu3.5 is designed to enhance the understanding of dynamic physical worlds, moving beyond mere visual realism to a deeper comprehension of context and interactions [8][10] Group 1: Model Capabilities - Emu3.5 can perform high-precision tasks such as erasing handwritten marks and generating dynamic 3D environments from a first-person perspective [2][3] - The model excels in generating coherent and logical outputs, simulating dynamic physical worlds, and maintaining spatial consistency during user interactions [11][20] - It can execute complex tasks like organizing a desktop by following a series of instructions, showcasing its ability to understand long-term sequences and spatial relationships [23][24][28] Group 2: Technical Innovations - Emu3.5 operates on a 34 billion parameter framework, utilizing a standard Decoder-only Transformer architecture to handle various tasks including visual storytelling and image editing [31] - The model has been pre-trained on over 10 trillion tokens of multimodal data, primarily sourced from internet videos, allowing it to learn temporal continuity and causal relationships effectively [32] - A powerful visual tokenizer with a vocabulary of 130,000 visual tokens enables high-fidelity image reconstruction at resolutions up to 2K [33] Group 3: Performance and Comparisons - Emu3.5's performance is competitive, matching or surpassing that of Gemini-2.5-Flash-Image in several authoritative benchmarks, particularly in text rendering and multimodal generation tasks [18] - The model's ability to maintain consistency and style across multiple images and instructions is noted as being at the industry's top level [29] Group 4: Future Implications - The open-source nature of Emu3.5 allows global developers and researchers to leverage its capabilities without starting from scratch, potentially transforming various industries [36] - The model's advancements in generating realistic videos and intelligent agents open up vast possibilities for practical applications across different sectors [37]