Workflow
Emu3
icon
Search documents
对话智源王仲远:机器人的大小脑可能会“合体”,但不是今天
AI前线· 2025-06-11 08:39
Core Insights - The article discusses the launch of the "Wujie" series of large models by Zhiyuan Research Institute, focusing on advancements in multi-modal AI technology and its applications in physical AGI [1][2][3] Group 1: New Model Launch - The "Wujie" series includes several models such as Emu3, Brainμ, RoboOS2.0, RoboBrain2.0, and OpenComplex2, aimed at enhancing AI's understanding and interaction with the physical world [1][2] - Emu3 is designed as a native multi-modal architecture that enables large models to comprehend and reason about the world, set to be released in October 2024 [3][4] Group 2: Technological Advancements - Brainμ, based on Emu3, integrates various brain signals to perform multiple neuroscience tasks, demonstrating significant performance improvements over existing models [4][5] - RoboOS2.0 is the first open-source framework for embodied intelligence, allowing seamless integration of skills from various robot models, with a 30% performance enhancement compared to its predecessor [6][7] Group 3: Applications and Collaborations - Brainμ has potential applications in brain-computer interfaces, having successfully reconstructed sensory signals using portable EEG systems [5] - The OpenComplex2 model represents a breakthrough in dynamic conformational modeling of biological molecules, enhancing the understanding of molecular interactions at atomic resolution [11][12] Group 4: Future Directions - The article emphasizes the ongoing evolution of large model technology, with a focus on bridging the gap between digital and physical worlds, which is crucial for achieving physical AGI [2][3] - RoboBrain2.0 has improved task planning and spatial reasoning capabilities, achieving a 74% increase in task planning accuracy compared to its predecessor [8][9]
聚焦多模态:ChatGPT时刻未到,2025大模型“变慢”了吗
Bei Jing Shang Bao· 2025-06-08 13:27
Core Insights - The emergence of multi-modal models, such as Emu3, signifies a shift in content generation, with the potential to understand and generate text, images, and videos through a single model [1][3] - The rapid development of AI has led to a competitive landscape where new and existing products coexist, but the core capabilities of video generation are still lagging behind expectations [1][5] - The commercial application of large models faces challenges, particularly in integrating visual generation with existing models, which limits scalability and effectiveness [7][8] Multi-Modal Model Development - Emu3, released by Zhiyuan Research Institute, is a native multi-modal model that incorporates various data types from the beginning of its training process, unlike traditional models that focus on language first [3][4] - The current learning path for multi-modal models often leads to a decline in performance as they transition from strong language capabilities to integrating other modalities [3][4] - The development of multi-modal models is still in its early stages, with significant technical challenges remaining, particularly in filtering effective information from diverse data types [3][4] Video Generation Challenges - Video generation technology is currently at a transitional phase, comparable to the evolution from GPT-2 to GPT-3, indicating that there is substantial room for improvement [5][6] - Key issues in video generation include narrative coherence, stability, and controllability, which are essential for producing high-quality content [6] - The industry is awaiting a breakthrough moment akin to the "ChatGPT moment" to enhance video generation capabilities [6] Commercialization and Market Growth - The multi-modal AI market is projected to reach $2.4 billion in 2024, with a compound annual growth rate (CAGR) exceeding 28%, and is expected to grow to $128 billion by 2025, reflecting a CAGR of 62.3% from 2023 to 2025 [8] - The integration of traditional computer vision models with large models is seen as a potential pathway for commercial applications, contingent on achieving a favorable cost-benefit ratio [7][8] - Companies are evolving their service models from providing platforms (PaaS) to offering tools (SaaS) and ultimately delivering direct results to users by 2025 [8]
对话智源研究院院长王仲远:AI正加速从数字世界走向物理世界
Core Insights - The rapid advancement of AI technology is shifting from digital to physical applications, with a focus on humanoid robots as practical tools rather than mere mascots [1][2] - The development trajectory of large models is moving towards multi-modal world models, which aim to enhance AI's understanding and interaction with the physical world [2][3] AI Technology Development - The performance of large language models is reaching a bottleneck, necessitating improvements through reinforcement learning, high-quality synthetic data, and activation of underutilized multi-modal data [1][2] - The introduction of the "Wujie" series of large models, including the Emu3 multi-modal world model, signifies a strategic shift towards understanding physical causal relationships [2][3] Embodied Intelligence - Humanoid robots are recognized for their long-term value due to their design compatibility with human environments and the availability of extensive human behavior data for model training [3][4] - The current limitations in data volume hinder the training of models that integrate both "big brain" and "small brain" functionalities, indicating a need for further development [4][6] Industry Trends - The focus on embodied intelligence is expected to prioritize applications in controlled environments, such as logistics and repetitive tasks, where safety and efficiency are paramount [3][4] - The concept of "big brain" and "small brain" integration is acknowledged as a potential future trend, but current data limitations prevent immediate implementation [4][5] AGI Development - The emergence of Agents in AI signifies a new phase where foundational models can support the development of various applications, akin to mobile apps in the internet era [5][6] - The industry is still in the early stages of embodied intelligence development, facing challenges similar to those encountered in the early days of AI large models [5][6]
从预训练到世界模型,智源借具身智能重构AI进化路径
Di Yi Cai Jing· 2025-06-07 12:41
Group 1 - The core viewpoint of the articles emphasizes the rapid development of AI and its transition from the digital world to the physical world, highlighting the importance of world models in this evolution [1][3][4] - The 2023 Zhiyuan Conference marked a shift in focus from large language models to the cultivation of world models, indicating a new phase in AI development [1][3] - The introduction of the "Wujie" series of large models by Zhiyuan represents a strategic move towards integrating AI with physical reality, showcasing advancements in multi-modal capabilities [3][4] Group 2 - The Emu3 model is a significant upgrade in multi-modal technology, simplifying the process of handling various data types and enhancing the path towards AGI (Artificial General Intelligence) [4][5] - The development of large models is still ongoing, with potential breakthroughs expected from reinforcement learning, data synthesis, and the utilization of multi-modal data [5][6] - The current challenges in embodied intelligence include a paradox where limited capabilities hinder data collection, which in turn restricts model performance [6][8] Group 3 - The industry faces issues such as poor scene generalization and task adaptability in robots, which limits their operational flexibility [9][10] - Control technologies like Model Predictive Control (MPC) have advantages but also limitations, such as being suitable only for structured environments [10] - The development of embodied large models is still in its early stages, with a lack of consensus on technical routes and the need for collaborative efforts to address foundational challenges [10]
智源研究院发布“悟界”系列大模型:让AI看见并理解物理世界
Jing Ji Guan Cha Wang· 2025-06-07 02:55
Core Insights - The Beijing Zhiyuan Conference showcased the latest developments in AI, including the release of the "Wujie" series of models by the Zhiyuan Research Institute, which aims to advance AI's understanding of the physical world [2][4] - The director of Zhiyuan, Wang Zhongyuan, emphasized that the next phase of AI development requires moving beyond language models to multi-modal world models that can perceive and interact with the physical environment [4][5] Model Releases - The "Wujie" series includes four models: Emu3, Brainμ, RoboOS 2.0, and RoboBrain 2.0, each designed to enhance AI's capabilities in understanding and interacting with the physical world [2][3] - Emu3 utilizes a new visual tokenizer technology to unify the representation of text, images, and videos, allowing AI to process them in a cohesive manner [3] - Brainμ aims to serve as a new engine for neuroscience research and clinical applications, integrating over one million neural signal data units [3] - RoboOS 2.0 improves performance by 30% compared to its predecessor, enabling faster integration of developer plugins and enhancing real-time response capabilities [3] - OpenComplex2 targets life sciences by simulating molecular movements at atomic resolution, potentially accelerating drug development and biological research [3] Strategic Partnerships and Goals - Zhiyuan has signed a strategic cooperation agreement with Hong Kong Investment Management Company to foster talent, technology, and capital collaboration [6] - The organization is committed to open-source and international collaboration, having already open-sourced 200 models with a total of 640 million downloads [7] - Wang Zhongyuan highlighted the importance of patience and sustained capital investment for long-term goals, despite short-term commercialization challenges [5][6]
智源发布“悟界”系列大模型,含全球首个原生多模态世界模型Emu3
Feng Huang Wang· 2025-06-06 14:32
Core Insights - The Zhiyuan Research Institute launched the "Wujie" series of large models, including Emu3, Brainμ, RoboOS 2.0, RoboBrain 2.0, and OpenComplex2, at the 2025 Beijing Zhiyuan Conference [1] Group 1: Emu3 and Brainμ Models - Emu3 is a native multimodal world model that utilizes a next-token prediction paradigm for unified multimodal learning, allowing for the encoding of images/videos into discrete symbol sequences [2] - Brainμ, built on the Emu3 architecture, integrates brain signals as a new modality, enabling a single model to perform various neuroscience tasks, potentially becoming the "AlphaFold" of brain science [2][3] Group 2: RoboOS 2.0 and RoboBrain 2.0 - RoboOS 2.0 is the world's first open-source framework for embodied intelligence SaaS platforms, significantly reducing development barriers and improving performance by 30% compared to its predecessor [4] - RoboBrain 2.0 enhances multi-agent task planning capabilities, achieving a 74% improvement in task planning accuracy over RoboBrain 1.0 [5] Group 3: OpenComplex2 Model - OpenComplex2 represents a breakthrough in modeling biological molecules, capturing molecular interactions at atomic resolution and providing insights into the relationship between microscopic fluctuations and macroscopic biological functions [6][7] Group 4: Open Source Initiatives - Zhiyuan has open-sourced approximately 200 models and 160 datasets, with the FlagOS software stack upgraded to support various AI hardware and improve performance by up to 23% [8] Group 5: Applications and Collaborations - The Brainμ model has shown potential in consumer-grade brain-computer interface applications, collaborating with leading neuroscience laboratories and companies to expand its industrial applications [3][11] - The development of a digital twin heart and a drug safety evaluation platform demonstrates the application of advanced modeling techniques in pharmacology and personalized medicine [12]
北京智源大会在京开幕,智源“悟界”系列大模型发布
Group 1 - The Beijing Zhiyuan Conference showcased cutting-edge AI achievements, gathering hundreds of global young scientists, top scholars, and industry experts to outline the future of the AI industry [1] - AI is rapidly transitioning from the digital world to the physical world, with the release of the original multimodal world model Emu3, which enhances understanding and reasoning in physical contexts [3][4] - The original multimodal model integrates various data types from the beginning of training, allowing for a more comprehensive understanding of the world, unlike traditional models that may lose capabilities when learning additional modalities [4] Group 2 - Beijing has over 2,400 core AI enterprises, contributing to a core industry scale of nearly 350 billion yuan, accounting for half of the national total [5][9] - The conference featured advanced humanoid robots demonstrating their capabilities, with companies like Galaxy General planning to open 100 unmanned pharmacies in major cities [6][8] - Discussions at the conference included topics such as multimodal AI, deep reasoning, and the future paths of AI, emphasizing the need for global cooperation and safety measures in the face of rapid AI advancements [10][13]
智源研究院发布“悟界”系列大模型,推动AI迈向物理世界
Xin Jing Bao· 2025-06-06 10:43
Core Insights - The Beijing Zhiyuan Conference, held on June 6, showcased the launch of the "Wujie" series of large models by the Zhiyuan Research Institute, marking a significant step in advancing artificial intelligence from the digital realm to the physical world [1][2] Group 1: Development of Large Models - The director of Zhiyuan Research Institute, Wang Zhongyuan, emphasized that the development of large model technology is far from reaching its peak, with ongoing advancements in performance and capabilities [2][3] - The transition from large language models to native multimodal world models is underway, aiming to enhance AI's perception and interaction with the physical world [2][3] Group 2: Multimodal Models and Applications - The "Wujie" series includes several models such as Emu3, Brainμ, RoboOS 2.0, and RoboBrain 2.0, which are designed to integrate various data modalities and enhance capabilities in fields like neuroscience and robotics [4][5][6] - Brainμ has shown superior predictive capabilities for conditions like depression and Alzheimer's compared to specialized models, integrating large-scale multimodal data for various applications [5][6] Group 3: Advancements in Robotics - RoboBrain 2.0 has achieved a 74% improvement in task planning accuracy compared to its predecessor, with overall performance enhancements of 30% and reduced response times [7][8] - The newly released RoboOS 2.0 framework allows for seamless integration of robotic systems, significantly reducing deployment time from days to hours [8] Group 4: Breakthroughs in Biomedicine - The OpenComplex2 model represents a breakthrough in dynamic modeling of biological molecules, which could significantly shorten drug development cycles and enhance the quality of innovations in the biomedicine sector [9] - The establishment of a high-speed cross-scale cardiac drug safety evaluation platform aims to expedite the assessment of drug toxicity, reducing evaluation time from 90 days to less than one day [9]
【智源发布“悟界”系列大模型】6月6日,第七届“北京智源大会”在北京开幕。在大会上,智源研究院推出“悟界”系列大模型,包括原生多模态世界模型Emu3、脑科学多模态通用基础模型见微Brainμ、跨本体具身大小脑协作框架RoboOS 2.0与具身大脑RoboBrain 2.0以及全原子微观生命模型OpenComplex2。
news flash· 2025-06-06 06:00
Core Insights - The "Wujie" series of large models was launched by the Zhiyuan Research Institute during the 7th Beijing Zhiyuan Conference held on June 6 [1] Group 1: Model Introductions - The series includes the native multimodal world model Emu3 [1] - It features the brain science multimodal general foundation model Jianwei Brainμ [1] - The cross-ontology embodied small brain collaboration framework RoboOS 2.0 and the embodied brain RoboBrain 2.0 are also part of the series [1] - Additionally, the full atomic microscopic life model OpenComplex2 was introduced [1]
ETT:打破原生多模态学习视觉瓶颈,重塑视觉tokenizer优化范式
机器之心· 2025-05-27 06:38
Core Viewpoint - The article introduces ETT (End-to-End Vision Tokenizer Tuning), a novel method that optimizes visual tokenization and downstream tasks jointly, addressing the limitations of traditional visual tokenization methods [2][4]. Group 1: Limitations of Traditional Methods - Traditional visual tokenization methods suffer from a critical flaw where the optimization of visual tokenizers is decoupled from the training of downstream tasks, leading to suboptimal performance in tasks requiring rich semantic representation [1][5]. - Existing multimodal pre-training frameworks, such as Emu3, utilize frozen visual tokenizers, wasting their rich feature representation capabilities and hindering end-to-end training [6][10]. Group 2: ETT Innovations - ETT innovatively combines visual tokenization with target autoregressive tasks for joint optimization, allowing visual tokenizers to adapt based on feedback from downstream tasks [4][10]. - The architecture of ETT is based on an improved IBQ framework, with a codebook size of 131,072 and feature dimensions set to 256, enhancing the efficiency of visual tokenizers [10]. Group 3: Training Strategy - ETT employs a structured training strategy, starting with an alignment learning phase where only the visual projection layer is trained while keeping the large language model and visual tokenizer parameters frozen [11]. - In the semantic learning phase, all model weights are unfrozen for end-to-end training, allowing the visual tokenizer to enhance its perceptual capabilities while maintaining image reconstruction abilities [11]. Group 4: Performance Metrics - ETT demonstrates superior performance in multimodal understanding tasks, achieving competitive results on benchmarks like GQA and MMBench, even with fewer model parameters and training data compared to state-of-the-art visual language models [12][13]. - In multimodal generation tasks, ETT matches the performance of advanced diffusion and autoregressive models while being more efficient in terms of model parameters and training data [14][15]. Group 5: Qualitative Results - ETT generates diverse and detailed visual content that adheres closely to text prompts, showcasing its ability to produce high-quality images across various artistic styles and themes [16]. Group 6: Visual Reconstruction - ETT significantly enhances visual reconstruction tasks, preserving low-level details while improving high-level semantic representation, thus providing better visual representations for multimodal tasks [17]. Group 7: Future Directions - Future research will focus on expanding the data scale and model capacity for ETT, exploring end-to-end training of visual tokenizers from scratch, and extending ETT's methodology to other modalities like video and audio [19]. Group 8: Conclusion - ETT represents a breakthrough in native multimodal learning, offering a simple yet effective approach to optimize visual tokenizers, thereby enhancing the performance of multimodal models and paving the way for broader applications [25].