Emu3
Search documents
我国科研机构主导的大模型成果首次登上Nature
Guan Cha Zhe Wang· 2026-02-07 01:15
Core Insights - The article discusses the groundbreaking AI research paper published in *Nature* by the Beijing Academy of Artificial Intelligence, introducing a multimodal model named "Emu3" that aims to unify various AI capabilities such as vision, language, and action through a single task of "next token prediction" [1][4][21]. Group 1: Emu3's Technical Innovations - Emu3 utilizes a unique "Vision Tokenizer" that compresses a 512x512 image into just 4,096 discrete symbols, achieving a compression ratio of 64:1, and further compresses video data in a time-efficient manner [8][9]. - The model architecture of Emu3 is a standard language model enhanced with 32,768 visual symbols, diverging from the complex encoder-decoder architectures used by other models [10][11]. - Emu3 demonstrates superior performance in various tasks, scoring 70.0 in human preference evaluations for image generation, 62.1 in visual language understanding, and 81.0 in video generation, surpassing established models [11]. Group 2: Scaling Laws and Multimodal Learning - Emu3's research confirms that multimodal learning adheres to predictable scaling laws, indicating that performance improves uniformly across different modalities when training data is increased [12][13]. - The findings suggest that future multimodal intelligence may not require separate training strategies for each capability, simplifying the development process [13]. Group 3: Comparison with Global Peers - Emu3 is positioned against models like Meta's Chameleon and OpenAI's Sora, showcasing its ability to bridge the performance gap between unified architectures and specialized models [17][18]. - Unlike OpenAI's approach, which requires additional models for understanding, Emu3 integrates generation and comprehension within a single framework [18]. Group 4: Commercialization Potential - Emu3's architecture allows for efficient deployment, leveraging existing infrastructure for large language models, which can reduce operational complexity and costs [19]. - The model's unified capabilities enable diverse applications, from generating instructional content to real-time video analysis, enhancing user interaction [20]. Group 5: Philosophical Implications - Emu3 challenges the notion of fragmented intelligence by proposing that intelligence can be unified through a single predictive framework, potentially reshaping the understanding of AI's capabilities [21][22]. - The success of Emu3 suggests a paradigm shift in AI development, emphasizing simplicity and unified approaches over complexity [22].
专访王仲远:智源多模态大模型登上《自然》,背后有群年轻人
Xin Jing Bao· 2026-02-03 14:17
Core Insights - The Emu3 multimodal model developed by the Beijing Academy of Artificial Intelligence has been published in the prestigious journal Nature, marking a significant achievement for China's research institutions in the field of AI [1][2]. Group 1: Emu3 Model Overview - Emu3 represents a unified architecture that simplifies the understanding and generation of various types of information, including text, images, and videos, by using a single model based on the principle of "predicting the next token" [3][4]. - The model's design allows for significant scalability and lower research and development barriers, enabling more researchers and institutions to engage in cutting-edge exploration [3][4]. Group 2: Technological Advancements - Emu3.5, the subsequent version, has been trained on over 10 trillion tokens, with video training duration increased from 15 years to 790 years, and the parameter count rising from 8 billion to 34 billion [6]. - This version demonstrates the ability to simulate physical world dynamics, marking a transition from "predicting the next word or frame" to "predicting the next state," which is crucial for achieving more general intelligence [6]. Group 3: Team and Innovation - The Emu3 development team is notably young, with the lead developer being only 29 years old, reflecting the institute's philosophy of empowering youth in AI innovation [7][8]. - The team faced significant technical challenges and skepticism from the industry but ultimately succeeded in proving the viability of their innovative approach to multimodal AI [8]. Group 4: Future Applications - Emu3 is positioned as a foundational model for advancing AI from the digital realm to the physical world, enabling applications in robotics and autonomous driving by providing a robust understanding of complex environments [5][10]. - The model is expected to give rise to a new generation of native multimodal assistants capable of creating images and videos based on contextual prompts, enhancing human-computer interaction [5]. Group 5: Talent Development and Institutional Support - The Beijing Academy of Artificial Intelligence emphasizes talent based on impactful work rather than credentials, fostering a dynamic environment for young researchers [9][10]. - The institute operates under a flexible funding model that allows researchers to focus on valuable scientific work without the pressures of traditional corporate structures [9].
智源多模态大模型Emu3首登《自然》
Ke Ji Ri Bao· 2026-02-02 05:23
Core Insights - The launch of the multimodal large model "Emu3" by the Beijing Zhiyuan Research Institute marks a significant breakthrough in China's original innovation in artificial intelligence, as it is the first model led by a Chinese research institution to be published in the prestigious journal "Nature" [2][6] Group 1: Model Performance - Emu3 demonstrates performance comparable to diffusion models in text-to-image tasks and exhibits visual language understanding capabilities on par with CLIP and large language model integration [6] - The model can generate high-fidelity videos in a purely autoregressive manner, supporting diverse tasks such as video extension, text-image interleaving generation, and robotic operation modeling [6] Group 2: Research Significance - The research team validated the scale law of multimodal learning through large-scale ablation experiments, confirming that Direct Preference Optimization (DPO) can seamlessly adapt to autoregressive visual generation [6] - The upcoming iteration, Emu3.5, showcases a leap in capability with the ability to "predict the next state," highlighting its generalized world modeling capabilities [6] Group 3: Strategic Importance - Emu3 establishes the autoregressive approach as a unified position in generative AI, further emphasizing the international competitiveness of China's foundational research in artificial intelligence [6]
登上Nature!智源研究院推出AI全能选手——Emu3,一统多模态学习
生物世界· 2026-01-31 03:05
Core Viewpoint - The article discusses the introduction of Emu3, a multimodal large model developed by Beijing Academy of Artificial Intelligence, which aims to unify the learning of text, images, and videos through next-token prediction, potentially transforming the AI landscape [2][3]. Multimodal Learning - Multimodal learning refers to the ability of AI to process various types of information simultaneously, akin to human sensory perception. Achieving a unified algorithm for learning and generating content from multiple modalities has been a long-standing challenge in the AI field [6]. Emu3's Mechanism - Emu3 employs a simple yet effective approach by converting all modal data into discrete tokens and using a Transformer model to predict the next token, which is a key factor in the success of GPT series language models [6][7]. Training Process - The training of Emu3 consists of three stages: 1. Pre-training with large-scale multimodal data, balancing the loss weights of text and visual tokens to prevent dominance of visual tokens [10]. 2. Post-training for quality fine-tuning on generation tasks, incorporating human preference optimization [10]. 3. Inference supporting classifier-free guidance for low-latency and high-throughput generation [11]. Performance Comparison - Emu3 has demonstrated performance that matches or exceeds specialized models across various tasks: - In image generation, it achieved a human preference score of 70.0, surpassing Stable Diffusion v1.5 (59.3) and SDXL (66.9) [13]. - In video generation, it scored 81.0 in VBench evaluation, comparable to mainstream diffusion models [13]. - In visual language understanding, it averaged 62.1 across 12 benchmark tests, rivaling models like LLaVA-1.6 [13]. - In robotic operations, it achieved a success rate of 87.0% in a simulated environment [13]. Significance of the Research - The significance of Emu3 lies not only in its performance improvements but also in its simplification of paradigms. It demonstrates that next-token prediction can serve as a core paradigm for multimodal models, paving the way for the development of more powerful "world models" that integrate perception, language, and action [15][17]. Future Developments - Following Emu3, the research team has introduced Emu3.5, which enhances the model's capabilities through large-scale long-sequence video training, improving its ability to model physical world dynamics and observing trends in multimodal capabilities as the model and data scale increase [15].
架构解耦是统一多模态模型所必须的吗?全新AIA损失:No
机器之心· 2025-12-02 05:07
Core Insights - The rapid development of unified understanding and generation models has faced challenges due to conflicts between visual understanding and generation tasks [2] - Researchers from CUHK MMLab and Meituan believe that the performance of unified models will eventually reach that of single-task models, but they question whether the current approach of decoupling architectures is truly beneficial [2][3] Unified Model Intent - The original intent of unified models is to enhance single-task performance through a transparent and rational process of interleaved text and image reasoning [3] - Examples include generating corresponding images while navigating mazes or drawing auxiliary lines during mathematical problem-solving [3] Architecture Decoupling Issues - Models like BAGEL require complex processes to achieve interleaved reasoning, leading to significant computational overhead and potential information loss [3] - Despite current performance gains, researchers warn that these issues may become more pronounced as research progresses [3] AIA Introduction - To explore the reasons behind performance improvements from architecture decoupling and to find ways to enhance model performance without it, CUHK MMLab and Meituan introduced AIA [5] Research Findings - Researchers found that regardless of how models are decoupled, understanding and generation tasks exhibit a negative correlation at the same network layer [8] - This indicates that decoupling does not fundamentally resolve the conflicts between tasks [8] AIA Loss Design - AIA loss was designed to explicitly constrain the interaction patterns of unified models during training, using the cross-modal interaction patterns of single-task models as a learning target [10] AIA Effectiveness - Experiments on Emu3 and Janus-Pro showed that AIA can enhance model performance without additional tricks, reducing the performance gap with more decoupled models [12] AIA Training Sensitivity - AIA loss demonstrated stable convergence across a wide range of weight adjustments during training, particularly for Emu3, which had weaker pre-training knowledge [17] - In contrast, Janus-Pro's strong pre-training knowledge made it more sensitive to AIA loss adjustments [17] AIA Advantages - The introduction of AIA loss can mitigate common data ratio issues, achieving better results with a 1:1 data ratio for generation and understanding tasks, indicating a collaborative optimization effect [19] Unified Model Training Path - The dynamic allocation of task weights during unified training may represent the correct behavior of unified models, suggesting that task conflicts could be a natural characteristic rather than a problem to avoid [21] - Another approach involves removing task differentiation cues to force the model to learn a truly unified space, though this increases training difficulty [22] Future Outlook - AIA represents an initial step in analyzing the principles of unified model training, with a call for more researchers to explore this field [24] - The theoretical and architectural aspects of unified models are still immature, necessitating collaborative exploration [24]
对话智源王仲远:机器人的大小脑可能会“合体”,但不是今天
AI前线· 2025-06-11 08:39
Core Insights - The article discusses the launch of the "Wujie" series of large models by Zhiyuan Research Institute, focusing on advancements in multi-modal AI technology and its applications in physical AGI [1][2][3] Group 1: New Model Launch - The "Wujie" series includes several models such as Emu3, Brainμ, RoboOS2.0, RoboBrain2.0, and OpenComplex2, aimed at enhancing AI's understanding and interaction with the physical world [1][2] - Emu3 is designed as a native multi-modal architecture that enables large models to comprehend and reason about the world, set to be released in October 2024 [3][4] Group 2: Technological Advancements - Brainμ, based on Emu3, integrates various brain signals to perform multiple neuroscience tasks, demonstrating significant performance improvements over existing models [4][5] - RoboOS2.0 is the first open-source framework for embodied intelligence, allowing seamless integration of skills from various robot models, with a 30% performance enhancement compared to its predecessor [6][7] Group 3: Applications and Collaborations - Brainμ has potential applications in brain-computer interfaces, having successfully reconstructed sensory signals using portable EEG systems [5] - The OpenComplex2 model represents a breakthrough in dynamic conformational modeling of biological molecules, enhancing the understanding of molecular interactions at atomic resolution [11][12] Group 4: Future Directions - The article emphasizes the ongoing evolution of large model technology, with a focus on bridging the gap between digital and physical worlds, which is crucial for achieving physical AGI [2][3] - RoboBrain2.0 has improved task planning and spatial reasoning capabilities, achieving a 74% increase in task planning accuracy compared to its predecessor [8][9]
聚焦多模态:ChatGPT时刻未到,2025大模型“变慢”了吗
Bei Jing Shang Bao· 2025-06-08 13:27
Core Insights - The emergence of multi-modal models, such as Emu3, signifies a shift in content generation, with the potential to understand and generate text, images, and videos through a single model [1][3] - The rapid development of AI has led to a competitive landscape where new and existing products coexist, but the core capabilities of video generation are still lagging behind expectations [1][5] - The commercial application of large models faces challenges, particularly in integrating visual generation with existing models, which limits scalability and effectiveness [7][8] Multi-Modal Model Development - Emu3, released by Zhiyuan Research Institute, is a native multi-modal model that incorporates various data types from the beginning of its training process, unlike traditional models that focus on language first [3][4] - The current learning path for multi-modal models often leads to a decline in performance as they transition from strong language capabilities to integrating other modalities [3][4] - The development of multi-modal models is still in its early stages, with significant technical challenges remaining, particularly in filtering effective information from diverse data types [3][4] Video Generation Challenges - Video generation technology is currently at a transitional phase, comparable to the evolution from GPT-2 to GPT-3, indicating that there is substantial room for improvement [5][6] - Key issues in video generation include narrative coherence, stability, and controllability, which are essential for producing high-quality content [6] - The industry is awaiting a breakthrough moment akin to the "ChatGPT moment" to enhance video generation capabilities [6] Commercialization and Market Growth - The multi-modal AI market is projected to reach $2.4 billion in 2024, with a compound annual growth rate (CAGR) exceeding 28%, and is expected to grow to $128 billion by 2025, reflecting a CAGR of 62.3% from 2023 to 2025 [8] - The integration of traditional computer vision models with large models is seen as a potential pathway for commercial applications, contingent on achieving a favorable cost-benefit ratio [7][8] - Companies are evolving their service models from providing platforms (PaaS) to offering tools (SaaS) and ultimately delivering direct results to users by 2025 [8]
对话智源研究院院长王仲远:AI正加速从数字世界走向物理世界
2 1 Shi Ji Jing Ji Bao Dao· 2025-06-08 11:49
Core Insights - The rapid advancement of AI technology is shifting from digital to physical applications, with a focus on humanoid robots as practical tools rather than mere mascots [1][2] - The development trajectory of large models is moving towards multi-modal world models, which aim to enhance AI's understanding and interaction with the physical world [2][3] AI Technology Development - The performance of large language models is reaching a bottleneck, necessitating improvements through reinforcement learning, high-quality synthetic data, and activation of underutilized multi-modal data [1][2] - The introduction of the "Wujie" series of large models, including the Emu3 multi-modal world model, signifies a strategic shift towards understanding physical causal relationships [2][3] Embodied Intelligence - Humanoid robots are recognized for their long-term value due to their design compatibility with human environments and the availability of extensive human behavior data for model training [3][4] - The current limitations in data volume hinder the training of models that integrate both "big brain" and "small brain" functionalities, indicating a need for further development [4][6] Industry Trends - The focus on embodied intelligence is expected to prioritize applications in controlled environments, such as logistics and repetitive tasks, where safety and efficiency are paramount [3][4] - The concept of "big brain" and "small brain" integration is acknowledged as a potential future trend, but current data limitations prevent immediate implementation [4][5] AGI Development - The emergence of Agents in AI signifies a new phase where foundational models can support the development of various applications, akin to mobile apps in the internet era [5][6] - The industry is still in the early stages of embodied intelligence development, facing challenges similar to those encountered in the early days of AI large models [5][6]
从预训练到世界模型,智源借具身智能重构AI进化路径
Di Yi Cai Jing· 2025-06-07 12:41
Group 1 - The core viewpoint of the articles emphasizes the rapid development of AI and its transition from the digital world to the physical world, highlighting the importance of world models in this evolution [1][3][4] - The 2023 Zhiyuan Conference marked a shift in focus from large language models to the cultivation of world models, indicating a new phase in AI development [1][3] - The introduction of the "Wujie" series of large models by Zhiyuan represents a strategic move towards integrating AI with physical reality, showcasing advancements in multi-modal capabilities [3][4] Group 2 - The Emu3 model is a significant upgrade in multi-modal technology, simplifying the process of handling various data types and enhancing the path towards AGI (Artificial General Intelligence) [4][5] - The development of large models is still ongoing, with potential breakthroughs expected from reinforcement learning, data synthesis, and the utilization of multi-modal data [5][6] - The current challenges in embodied intelligence include a paradox where limited capabilities hinder data collection, which in turn restricts model performance [6][8] Group 3 - The industry faces issues such as poor scene generalization and task adaptability in robots, which limits their operational flexibility [9][10] - Control technologies like Model Predictive Control (MPC) have advantages but also limitations, such as being suitable only for structured environments [10] - The development of embodied large models is still in its early stages, with a lack of consensus on technical routes and the need for collaborative efforts to address foundational challenges [10]
智源研究院发布“悟界”系列大模型:让AI看见并理解物理世界
Jing Ji Guan Cha Wang· 2025-06-07 02:55
Core Insights - The Beijing Zhiyuan Conference showcased the latest developments in AI, including the release of the "Wujie" series of models by the Zhiyuan Research Institute, which aims to advance AI's understanding of the physical world [2][4] - The director of Zhiyuan, Wang Zhongyuan, emphasized that the next phase of AI development requires moving beyond language models to multi-modal world models that can perceive and interact with the physical environment [4][5] Model Releases - The "Wujie" series includes four models: Emu3, Brainμ, RoboOS 2.0, and RoboBrain 2.0, each designed to enhance AI's capabilities in understanding and interacting with the physical world [2][3] - Emu3 utilizes a new visual tokenizer technology to unify the representation of text, images, and videos, allowing AI to process them in a cohesive manner [3] - Brainμ aims to serve as a new engine for neuroscience research and clinical applications, integrating over one million neural signal data units [3] - RoboOS 2.0 improves performance by 30% compared to its predecessor, enabling faster integration of developer plugins and enhancing real-time response capabilities [3] - OpenComplex2 targets life sciences by simulating molecular movements at atomic resolution, potentially accelerating drug development and biological research [3] Strategic Partnerships and Goals - Zhiyuan has signed a strategic cooperation agreement with Hong Kong Investment Management Company to foster talent, technology, and capital collaboration [6] - The organization is committed to open-source and international collaboration, having already open-sourced 200 models with a total of 640 million downloads [7] - Wang Zhongyuan highlighted the importance of patience and sustained capital investment for long-term goals, despite short-term commercialization challenges [5][6]