Workflow
Emu3
icon
Search documents
我国科研机构主导的大模型成果首次登上Nature
Guan Cha Zhe Wang· 2026-02-07 01:15
【文/观察者网专栏作者 心智观察所】 几天前,《Nature》杂志刊发了一篇来自中国的人工智能研究论文。这在顶级学术期刊上并非新鲜事, 但这篇论文的分量却非同寻常:它来自北京智源人工智能研究院,核心成果是一个名为"Emu3"的多模 态大模型,而它试图回答的问题,是整个AI领域过去五年来悬而未决的核心命题——我们能否用一种 统一的方式,让机器同时学会看、听、说、写,乃至行动? 这个问题听起来简单,但它的复杂程度足以让全球顶尖的AI实验室争论不休。 这个选择的意义,可能需要一些背景知识才能理解。 | https://doi.org/10.1038/s41586-025-10041-x | Xinlong Wang148, Yufeng Cui14, Jinsheng Wang14, Fan Zhang14, Yu | | --- | --- | | Received: 11 November 2024 | Xiaosong Zhang14, Zhengxiong Luo14, Quan Sun14, Zhen Li14, Yuqi | | | Yingli Zhao', Yulong Ao', Xuebin Mi ...
专访王仲远:智源多模态大模型登上《自然》,背后有群年轻人
Xin Jing Bao· 2026-02-03 14:17
Core Insights - The Emu3 multimodal model developed by the Beijing Academy of Artificial Intelligence has been published in the prestigious journal Nature, marking a significant achievement for China's research institutions in the field of AI [1][2]. Group 1: Emu3 Model Overview - Emu3 represents a unified architecture that simplifies the understanding and generation of various types of information, including text, images, and videos, by using a single model based on the principle of "predicting the next token" [3][4]. - The model's design allows for significant scalability and lower research and development barriers, enabling more researchers and institutions to engage in cutting-edge exploration [3][4]. Group 2: Technological Advancements - Emu3.5, the subsequent version, has been trained on over 10 trillion tokens, with video training duration increased from 15 years to 790 years, and the parameter count rising from 8 billion to 34 billion [6]. - This version demonstrates the ability to simulate physical world dynamics, marking a transition from "predicting the next word or frame" to "predicting the next state," which is crucial for achieving more general intelligence [6]. Group 3: Team and Innovation - The Emu3 development team is notably young, with the lead developer being only 29 years old, reflecting the institute's philosophy of empowering youth in AI innovation [7][8]. - The team faced significant technical challenges and skepticism from the industry but ultimately succeeded in proving the viability of their innovative approach to multimodal AI [8]. Group 4: Future Applications - Emu3 is positioned as a foundational model for advancing AI from the digital realm to the physical world, enabling applications in robotics and autonomous driving by providing a robust understanding of complex environments [5][10]. - The model is expected to give rise to a new generation of native multimodal assistants capable of creating images and videos based on contextual prompts, enhancing human-computer interaction [5]. Group 5: Talent Development and Institutional Support - The Beijing Academy of Artificial Intelligence emphasizes talent based on impactful work rather than credentials, fostering a dynamic environment for young researchers [9][10]. - The institute operates under a flexible funding model that allows researchers to focus on valuable scientific work without the pressures of traditional corporate structures [9].
智源多模态大模型Emu3首登《自然》
Ke Ji Ri Bao· 2026-02-02 05:23
1月28日,北京智源研究院主导的多模态大模型成果"Emu3"正式上线国际顶级学术期刊《自然》正刊(纸质版预计将于2月12日刊发),这是我国 科研机构主导的大模型成果首次登陆该期刊,标志着我国在人工智能原始创新领域取得重大突破。 此前,语言大模型依托"预测下一个词元(NTP)"的自回归路线实现重大突破,但多模态模型仍依赖对比学习、扩散模型等专用路线,自回归能 否成为多模态通用路线一直是行业未解之谜。智源团队提出的Emu3模型,将文本、图像、视频统一离散化到同一表示空间,基于单一Transformer 架构从零开始联合训练,仅凭"预测下一个词元"就实现了多模态生成与感知的统一。 实验显示,Emu3在文生图任务中性能比肩扩散模型,视觉语言理解能力媲美CLIP与大语言模型融合方案,还能以纯自回归方式生成高保真视 频,支持视频延展、图文交错生成及机器人操作建模等多元任务。《自然》编辑点评称,该成果对构建可扩展、统一的多模态智能系统具有重要 意义。 值得关注的是,该团队通过大规模消融实验验证了多模态学习的规模定律,证实直接偏好优化(DPO)可无缝适配自回归视觉生成。后续迭代的 Emu3.5更实现"预测下一个状态"的能力跃 ...
登上Nature!智源研究院推出AI全能选手——Emu3,一统多模态学习
生物世界· 2026-01-31 03:05
撰文丨王聪 编辑丨王多鱼 排版丨水成文 AI 模型 能否像人类一样,同时理解 文字 、 图像 、 视频 甚至 动作 ?过去,AI 领域需要针对不同任务使 用不同模型——例如用扩散模型生成图像,用组合架构处理视觉语言理解。 而现在, 北京智源人工智能研究院推出了一款多模态大模型—— Emu3 ,或将改变这一局面。 该研究以: Multimodal learning with next-token prediction for large multimodal models ( 通过预测 下一个词元进行多模态学习的多模态大模型) 为题,于 2026 年 1 月 28 日在线发表于 Nature 期刊, 北 京智源人工智能研究院 黄铁军 、 王仲远 、 王鑫龙 为论文共同通讯作者,据悉,这也是 我国科研机构主 导的大模型成果首次在 Nature 正刊发表。 Emu3 仅基于 预测下一个词元 (Next-token predictio,NTP) ,就统一了 大规模文本、图像和视频的 多模态学习, 它不仅在生成和理解任务上媲美专用模型,还展示了视频生成、机器人操作等强大能力,这 一成果对构建可扩展、统一的 多模态智能系 ...
架构解耦是统一多模态模型所必须的吗?全新AIA损失:No
机器之心· 2025-12-02 05:07
Core Insights - The rapid development of unified understanding and generation models has faced challenges due to conflicts between visual understanding and generation tasks [2] - Researchers from CUHK MMLab and Meituan believe that the performance of unified models will eventually reach that of single-task models, but they question whether the current approach of decoupling architectures is truly beneficial [2][3] Unified Model Intent - The original intent of unified models is to enhance single-task performance through a transparent and rational process of interleaved text and image reasoning [3] - Examples include generating corresponding images while navigating mazes or drawing auxiliary lines during mathematical problem-solving [3] Architecture Decoupling Issues - Models like BAGEL require complex processes to achieve interleaved reasoning, leading to significant computational overhead and potential information loss [3] - Despite current performance gains, researchers warn that these issues may become more pronounced as research progresses [3] AIA Introduction - To explore the reasons behind performance improvements from architecture decoupling and to find ways to enhance model performance without it, CUHK MMLab and Meituan introduced AIA [5] Research Findings - Researchers found that regardless of how models are decoupled, understanding and generation tasks exhibit a negative correlation at the same network layer [8] - This indicates that decoupling does not fundamentally resolve the conflicts between tasks [8] AIA Loss Design - AIA loss was designed to explicitly constrain the interaction patterns of unified models during training, using the cross-modal interaction patterns of single-task models as a learning target [10] AIA Effectiveness - Experiments on Emu3 and Janus-Pro showed that AIA can enhance model performance without additional tricks, reducing the performance gap with more decoupled models [12] AIA Training Sensitivity - AIA loss demonstrated stable convergence across a wide range of weight adjustments during training, particularly for Emu3, which had weaker pre-training knowledge [17] - In contrast, Janus-Pro's strong pre-training knowledge made it more sensitive to AIA loss adjustments [17] AIA Advantages - The introduction of AIA loss can mitigate common data ratio issues, achieving better results with a 1:1 data ratio for generation and understanding tasks, indicating a collaborative optimization effect [19] Unified Model Training Path - The dynamic allocation of task weights during unified training may represent the correct behavior of unified models, suggesting that task conflicts could be a natural characteristic rather than a problem to avoid [21] - Another approach involves removing task differentiation cues to force the model to learn a truly unified space, though this increases training difficulty [22] Future Outlook - AIA represents an initial step in analyzing the principles of unified model training, with a call for more researchers to explore this field [24] - The theoretical and architectural aspects of unified models are still immature, necessitating collaborative exploration [24]
对话智源王仲远:机器人的大小脑可能会“合体”,但不是今天
AI前线· 2025-06-11 08:39
Core Insights - The article discusses the launch of the "Wujie" series of large models by Zhiyuan Research Institute, focusing on advancements in multi-modal AI technology and its applications in physical AGI [1][2][3] Group 1: New Model Launch - The "Wujie" series includes several models such as Emu3, Brainμ, RoboOS2.0, RoboBrain2.0, and OpenComplex2, aimed at enhancing AI's understanding and interaction with the physical world [1][2] - Emu3 is designed as a native multi-modal architecture that enables large models to comprehend and reason about the world, set to be released in October 2024 [3][4] Group 2: Technological Advancements - Brainμ, based on Emu3, integrates various brain signals to perform multiple neuroscience tasks, demonstrating significant performance improvements over existing models [4][5] - RoboOS2.0 is the first open-source framework for embodied intelligence, allowing seamless integration of skills from various robot models, with a 30% performance enhancement compared to its predecessor [6][7] Group 3: Applications and Collaborations - Brainμ has potential applications in brain-computer interfaces, having successfully reconstructed sensory signals using portable EEG systems [5] - The OpenComplex2 model represents a breakthrough in dynamic conformational modeling of biological molecules, enhancing the understanding of molecular interactions at atomic resolution [11][12] Group 4: Future Directions - The article emphasizes the ongoing evolution of large model technology, with a focus on bridging the gap between digital and physical worlds, which is crucial for achieving physical AGI [2][3] - RoboBrain2.0 has improved task planning and spatial reasoning capabilities, achieving a 74% increase in task planning accuracy compared to its predecessor [8][9]
聚焦多模态:ChatGPT时刻未到,2025大模型“变慢”了吗
Bei Jing Shang Bao· 2025-06-08 13:27
Core Insights - The emergence of multi-modal models, such as Emu3, signifies a shift in content generation, with the potential to understand and generate text, images, and videos through a single model [1][3] - The rapid development of AI has led to a competitive landscape where new and existing products coexist, but the core capabilities of video generation are still lagging behind expectations [1][5] - The commercial application of large models faces challenges, particularly in integrating visual generation with existing models, which limits scalability and effectiveness [7][8] Multi-Modal Model Development - Emu3, released by Zhiyuan Research Institute, is a native multi-modal model that incorporates various data types from the beginning of its training process, unlike traditional models that focus on language first [3][4] - The current learning path for multi-modal models often leads to a decline in performance as they transition from strong language capabilities to integrating other modalities [3][4] - The development of multi-modal models is still in its early stages, with significant technical challenges remaining, particularly in filtering effective information from diverse data types [3][4] Video Generation Challenges - Video generation technology is currently at a transitional phase, comparable to the evolution from GPT-2 to GPT-3, indicating that there is substantial room for improvement [5][6] - Key issues in video generation include narrative coherence, stability, and controllability, which are essential for producing high-quality content [6] - The industry is awaiting a breakthrough moment akin to the "ChatGPT moment" to enhance video generation capabilities [6] Commercialization and Market Growth - The multi-modal AI market is projected to reach $2.4 billion in 2024, with a compound annual growth rate (CAGR) exceeding 28%, and is expected to grow to $128 billion by 2025, reflecting a CAGR of 62.3% from 2023 to 2025 [8] - The integration of traditional computer vision models with large models is seen as a potential pathway for commercial applications, contingent on achieving a favorable cost-benefit ratio [7][8] - Companies are evolving their service models from providing platforms (PaaS) to offering tools (SaaS) and ultimately delivering direct results to users by 2025 [8]
对话智源研究院院长王仲远:AI正加速从数字世界走向物理世界
Core Insights - The rapid advancement of AI technology is shifting from digital to physical applications, with a focus on humanoid robots as practical tools rather than mere mascots [1][2] - The development trajectory of large models is moving towards multi-modal world models, which aim to enhance AI's understanding and interaction with the physical world [2][3] AI Technology Development - The performance of large language models is reaching a bottleneck, necessitating improvements through reinforcement learning, high-quality synthetic data, and activation of underutilized multi-modal data [1][2] - The introduction of the "Wujie" series of large models, including the Emu3 multi-modal world model, signifies a strategic shift towards understanding physical causal relationships [2][3] Embodied Intelligence - Humanoid robots are recognized for their long-term value due to their design compatibility with human environments and the availability of extensive human behavior data for model training [3][4] - The current limitations in data volume hinder the training of models that integrate both "big brain" and "small brain" functionalities, indicating a need for further development [4][6] Industry Trends - The focus on embodied intelligence is expected to prioritize applications in controlled environments, such as logistics and repetitive tasks, where safety and efficiency are paramount [3][4] - The concept of "big brain" and "small brain" integration is acknowledged as a potential future trend, but current data limitations prevent immediate implementation [4][5] AGI Development - The emergence of Agents in AI signifies a new phase where foundational models can support the development of various applications, akin to mobile apps in the internet era [5][6] - The industry is still in the early stages of embodied intelligence development, facing challenges similar to those encountered in the early days of AI large models [5][6]
从预训练到世界模型,智源借具身智能重构AI进化路径
Di Yi Cai Jing· 2025-06-07 12:41
Group 1 - The core viewpoint of the articles emphasizes the rapid development of AI and its transition from the digital world to the physical world, highlighting the importance of world models in this evolution [1][3][4] - The 2023 Zhiyuan Conference marked a shift in focus from large language models to the cultivation of world models, indicating a new phase in AI development [1][3] - The introduction of the "Wujie" series of large models by Zhiyuan represents a strategic move towards integrating AI with physical reality, showcasing advancements in multi-modal capabilities [3][4] Group 2 - The Emu3 model is a significant upgrade in multi-modal technology, simplifying the process of handling various data types and enhancing the path towards AGI (Artificial General Intelligence) [4][5] - The development of large models is still ongoing, with potential breakthroughs expected from reinforcement learning, data synthesis, and the utilization of multi-modal data [5][6] - The current challenges in embodied intelligence include a paradox where limited capabilities hinder data collection, which in turn restricts model performance [6][8] Group 3 - The industry faces issues such as poor scene generalization and task adaptability in robots, which limits their operational flexibility [9][10] - Control technologies like Model Predictive Control (MPC) have advantages but also limitations, such as being suitable only for structured environments [10] - The development of embodied large models is still in its early stages, with a lack of consensus on technical routes and the need for collaborative efforts to address foundational challenges [10]
智源研究院发布“悟界”系列大模型:让AI看见并理解物理世界
Jing Ji Guan Cha Wang· 2025-06-07 02:55
Core Insights - The Beijing Zhiyuan Conference showcased the latest developments in AI, including the release of the "Wujie" series of models by the Zhiyuan Research Institute, which aims to advance AI's understanding of the physical world [2][4] - The director of Zhiyuan, Wang Zhongyuan, emphasized that the next phase of AI development requires moving beyond language models to multi-modal world models that can perceive and interact with the physical environment [4][5] Model Releases - The "Wujie" series includes four models: Emu3, Brainμ, RoboOS 2.0, and RoboBrain 2.0, each designed to enhance AI's capabilities in understanding and interacting with the physical world [2][3] - Emu3 utilizes a new visual tokenizer technology to unify the representation of text, images, and videos, allowing AI to process them in a cohesive manner [3] - Brainμ aims to serve as a new engine for neuroscience research and clinical applications, integrating over one million neural signal data units [3] - RoboOS 2.0 improves performance by 30% compared to its predecessor, enabling faster integration of developer plugins and enhancing real-time response capabilities [3] - OpenComplex2 targets life sciences by simulating molecular movements at atomic resolution, potentially accelerating drug development and biological research [3] Strategic Partnerships and Goals - Zhiyuan has signed a strategic cooperation agreement with Hong Kong Investment Management Company to foster talent, technology, and capital collaboration [6] - The organization is committed to open-source and international collaboration, having already open-sourced 200 models with a total of 640 million downloads [7] - Wang Zhongyuan highlighted the importance of patience and sustained capital investment for long-term goals, despite short-term commercialization challenges [5][6]