登上Nature！智源研究院推出AI全能选手——Emu3，一统多模态学习

Core Viewpoint - The article discusses the introduction of Emu3, a multimodal large model developed by Beijing Academy of Artificial Intelligence, which aims to unify the learning of text, images, and videos through next-token prediction, potentially transforming the AI landscape [2][3]. Multimodal Learning - Multimodal learning refers to the ability of AI to process various types of information simultaneously, akin to human sensory perception. Achieving a unified algorithm for learning and generating content from multiple modalities has been a long-standing challenge in the AI field [6]. Emu3's Mechanism - Emu3 employs a simple yet effective approach by converting all modal data into discrete tokens and using a Transformer model to predict the next token, which is a key factor in the success of GPT series language models [6][7]. Training Process - The training of Emu3 consists of three stages: 1. Pre-training with large-scale multimodal data, balancing the loss weights of text and visual tokens to prevent dominance of visual tokens [10]. 2. Post-training for quality fine-tuning on generation tasks, incorporating human preference optimization [10]. 3. Inference supporting classifier-free guidance for low-latency and high-throughput generation [11]. Performance Comparison - Emu3 has demonstrated performance that matches or exceeds specialized models across various tasks: - In image generation, it achieved a human preference score of 70.0, surpassing Stable Diffusion v1.5 (59.3) and SDXL (66.9) [13]. - In video generation, it scored 81.0 in VBench evaluation, comparable to mainstream diffusion models [13]. - In visual language understanding, it averaged 62.1 across 12 benchmark tests, rivaling models like LLaVA-1.6 [13]. - In robotic operations, it achieved a success rate of 87.0% in a simulated environment [13]. Significance of the Research - The significance of Emu3 lies not only in its performance improvements but also in its simplification of paradigms. It demonstrates that next-token prediction can serve as a core paradigm for multimodal models, paving the way for the development of more powerful "world models" that integrate perception, language, and action [15][17]. Future Developments - Following Emu3, the research team has introduced Emu3.5, which enhances the model's capabilities through large-scale long-sequence video training, improving its ability to model physical world dynamics and observing trends in multimodal capabilities as the model and data scale increase [15].