Workflow
智源研究院
icon
Search documents
从预训练到世界模型,智源借具身智能重构AI进化路径
Di Yi Cai Jing· 2025-06-07 12:41
Group 1 - The core viewpoint of the articles emphasizes the rapid development of AI and its transition from the digital world to the physical world, highlighting the importance of world models in this evolution [1][3][4] - The 2023 Zhiyuan Conference marked a shift in focus from large language models to the cultivation of world models, indicating a new phase in AI development [1][3] - The introduction of the "Wujie" series of large models by Zhiyuan represents a strategic move towards integrating AI with physical reality, showcasing advancements in multi-modal capabilities [3][4] Group 2 - The Emu3 model is a significant upgrade in multi-modal technology, simplifying the process of handling various data types and enhancing the path towards AGI (Artificial General Intelligence) [4][5] - The development of large models is still ongoing, with potential breakthroughs expected from reinforcement learning, data synthesis, and the utilization of multi-modal data [5][6] - The current challenges in embodied intelligence include a paradox where limited capabilities hinder data collection, which in turn restricts model performance [6][8] Group 3 - The industry faces issues such as poor scene generalization and task adaptability in robots, which limits their operational flexibility [9][10] - Control technologies like Model Predictive Control (MPC) have advantages but also limitations, such as being suitable only for structured environments [10] - The development of embodied large models is still in its early stages, with a lack of consensus on technical routes and the need for collaborative efforts to address foundational challenges [10]
深度学习与强化学习两大巨头齐聚2025北京智源大会 智源发布“悟界”系列大模型
机器人圈· 2025-06-07 04:02
2025 年 6 月 6 日,第七届 " 北京智源大会 " 在中关村展示中心开幕。 北京智源大会是智源研究院主办的 "AI 内行学术盛会 " ,以 " 全球视野、思想碰撞、前沿引领 " 为特色,汇聚海内外研究 者分享研究成果、探寻前沿知识、交流实践经验。 2025 北京智源大会邀请到了图灵奖得主、深度学习代表人物 Yoshua Bengio ,图灵奖得主、强化学习之父 Richard S. Sutton ,图灵奖得主 Joseph Sifakis 、姚期智, Google 、 DeepMind 、 Meta 、 Mila 、 Physical Intelligence 、 MIT 、斯坦福、 UC Berkeley 、 Linux 基金会等国际明星机构与技术团队代 表,华为、百度、字节跳动、腾讯、阿里等互联网大厂以及智谱、宇树科技、生数科技、面壁等 30 余位 AI 公司创始人、 CEO ,同时,大会还汇聚了 100 余位全球青年科学家、 200 余位人工智能顶尖学者和产业专家,围绕多模态、深度推 理、下一代 AI 路径、 Agent 智能体、具身智能、 AI4S 、 AI 产业、 AI 安全、 AI 开源展 ...
智源发布“悟界”系列大模型,含全球首个原生多模态世界模型Emu3
Feng Huang Wang· 2025-06-06 14:32
Core Insights - The Zhiyuan Research Institute launched the "Wujie" series of large models, including Emu3, Brainμ, RoboOS 2.0, RoboBrain 2.0, and OpenComplex2, at the 2025 Beijing Zhiyuan Conference [1] Group 1: Emu3 and Brainμ Models - Emu3 is a native multimodal world model that utilizes a next-token prediction paradigm for unified multimodal learning, allowing for the encoding of images/videos into discrete symbol sequences [2] - Brainμ, built on the Emu3 architecture, integrates brain signals as a new modality, enabling a single model to perform various neuroscience tasks, potentially becoming the "AlphaFold" of brain science [2][3] Group 2: RoboOS 2.0 and RoboBrain 2.0 - RoboOS 2.0 is the world's first open-source framework for embodied intelligence SaaS platforms, significantly reducing development barriers and improving performance by 30% compared to its predecessor [4] - RoboBrain 2.0 enhances multi-agent task planning capabilities, achieving a 74% improvement in task planning accuracy over RoboBrain 1.0 [5] Group 3: OpenComplex2 Model - OpenComplex2 represents a breakthrough in modeling biological molecules, capturing molecular interactions at atomic resolution and providing insights into the relationship between microscopic fluctuations and macroscopic biological functions [6][7] Group 4: Open Source Initiatives - Zhiyuan has open-sourced approximately 200 models and 160 datasets, with the FlagOS software stack upgraded to support various AI hardware and improve performance by up to 23% [8] Group 5: Applications and Collaborations - The Brainμ model has shown potential in consumer-grade brain-computer interface applications, collaborating with leading neuroscience laboratories and companies to expand its industrial applications [3][11] - The development of a digital twin heart and a drug safety evaluation platform demonstrates the application of advanced modeling techniques in pharmacology and personalized medicine [12]
智源研究院发布“悟界”系列大模型,推动AI迈向物理世界
Xin Jing Bao· 2025-06-06 10:43
Core Insights - The Beijing Zhiyuan Conference, held on June 6, showcased the launch of the "Wujie" series of large models by the Zhiyuan Research Institute, marking a significant step in advancing artificial intelligence from the digital realm to the physical world [1][2] Group 1: Development of Large Models - The director of Zhiyuan Research Institute, Wang Zhongyuan, emphasized that the development of large model technology is far from reaching its peak, with ongoing advancements in performance and capabilities [2][3] - The transition from large language models to native multimodal world models is underway, aiming to enhance AI's perception and interaction with the physical world [2][3] Group 2: Multimodal Models and Applications - The "Wujie" series includes several models such as Emu3, Brainμ, RoboOS 2.0, and RoboBrain 2.0, which are designed to integrate various data modalities and enhance capabilities in fields like neuroscience and robotics [4][5][6] - Brainμ has shown superior predictive capabilities for conditions like depression and Alzheimer's compared to specialized models, integrating large-scale multimodal data for various applications [5][6] Group 3: Advancements in Robotics - RoboBrain 2.0 has achieved a 74% improvement in task planning accuracy compared to its predecessor, with overall performance enhancements of 30% and reduced response times [7][8] - The newly released RoboOS 2.0 framework allows for seamless integration of robotic systems, significantly reducing deployment time from days to hours [8] Group 4: Breakthroughs in Biomedicine - The OpenComplex2 model represents a breakthrough in dynamic modeling of biological molecules, which could significantly shorten drug development cycles and enhance the quality of innovations in the biomedicine sector [9] - The establishment of a high-speed cross-scale cardiac drug safety evaluation platform aims to expedite the assessment of drug toxicity, reducing evaluation time from 90 days to less than one day [9]
刚刚,智源全新「悟界」系列大模型炸场!AI第一次真正「看见」宏观-微观双宇宙
机器之心· 2025-06-06 09:36
Core Viewpoint - The article discusses the advancements in AI technology, particularly focusing on the launch of the "Wujie" series of large models by Zhiyuan Institute, which signifies a shift from digital to physical world modeling and understanding at both macro and micro levels [4][8][40]. Group 1: AI Advancements and Trends - The AI field remains vibrant and rapidly evolving, with significant developments in reinforcement learning and various AI domains such as intelligent agents and multimodal models [2][3]. - The annual Zhiyuan Conference showcased insights from leading experts, including Turing Award winners, on the future paths of AI [3]. - The "Wujie" series represents a new phase in large model exploration, focusing on bridging the gap between virtual and physical worlds [4][7]. Group 2: "Wujie" Series Features - The "Wujie" series includes several key models: Emu3 (multimodal world model), Brainμ (brain science model), RoboOS 2.0 (embodied intelligence framework), and OpenComplex2 (microscopic life model) [6][15][34]. - Emu3 is the first native multimodal world model, integrating various modalities like text, images, and brain signals into a unified representation [14]. - Brainμ is a groundbreaking model in brain science, capable of processing over 1 million neural signal data units and supporting various neuroscience tasks [15][19]. Group 3: Embodied Intelligence Development - The embodied intelligence sector has become a strategic focus, with the introduction of RoboOS 2.0 and RoboBrain 2.0, which enhance the capabilities of embodied AI systems [20][22]. - RoboOS 2.0 introduces a user-friendly framework for developers, significantly reducing the complexity of deploying robotic systems [24]. - RoboBrain 2.0 is noted for its superior performance in task planning and spatial reasoning, achieving a 74% improvement in task planning accuracy compared to its predecessor [27]. Group 4: Microscopic Life Modeling - OpenComplex2 marks a significant advancement in modeling microscopic life, capable of predicting static and dynamic structures of biological molecules [34][38]. - The model has demonstrated its effectiveness by successfully predicting protein structures in a competitive evaluation, showcasing its potential in life sciences [36]. - OpenComplex2 aims to revolutionize drug discovery and biological research by providing a new modeling pathway for understanding molecular dynamics [38]. Group 5: Future Directions - The "Wujie" series reflects a strategic upgrade in AI paradigms, emphasizing the importance of modeling the physical world and integrating various AI domains [40]. - The future of large models is expected to extend beyond traditional applications, influencing systems that understand and change the world [41].
【智源发布“悟界”系列大模型】6月6日,第七届“北京智源大会”在北京开幕。在大会上,智源研究院推出“悟界”系列大模型,包括原生多模态世界模型Emu3、脑科学多模态通用基础模型见微Brainμ、跨本体具身大小脑协作框架RoboOS 2.0与具身大脑RoboBrain 2.0以及全原子微观生命模型OpenComplex2。
news flash· 2025-06-06 06:00
Core Insights - The "Wujie" series of large models was launched by the Zhiyuan Research Institute during the 7th Beijing Zhiyuan Conference held on June 6 [1] Group 1: Model Introductions - The series includes the native multimodal world model Emu3 [1] - It features the brain science multimodal general foundation model Jianwei Brainμ [1] - The cross-ontology embodied small brain collaboration framework RoboOS 2.0 and the embodied brain RoboBrain 2.0 are also part of the series [1] - Additionally, the full atomic microscopic life model OpenComplex2 was introduced [1]
对话智源研究院院长王仲远:人工智能正加速从数字世界走向物理世界
Mei Ri Jing Ji Xin Wen· 2025-06-06 05:15
每经记者|可杨 每经编辑|董兴生 6月6日,智源研究院在"2025智源大会"上发布"悟界"系列大模型,宣告其从"悟道"时代迈入"具身智能"探索阶段。 智源研究院院长王仲远在接受《每日经济新闻》记者在内的媒体采访时表示,"AI(人工智能)正加速从数字世界走向物理世界",这是推动其战略升级的根 本逻辑。 王仲远 图片来源:主办方供图 这一判断背后,是AI技术与应用边界的重构。当前,主流大模型大多聚焦在C端文本生成、语言对话等"数字智能"场景,而智源试图将AI推向更具挑战性也 更具想象空间的"现实世界"——包括机器人、操作系统与世界模型的构建。在王仲远看来:"这个世界不需要那么多'博士',更需要能执行任务、能落地的 AI。" "具身智能"正成为下一场AI竞赛的起点。王仲远判断,具身智能的"小组赛"还没结束,远没有到"淘汰赛"。但谁能在这一新赛道率先跑通技术路径、突破数 据瓶颈,谁或将定义人工智能的下一个十年。 从早期的"悟道"系列到如今的"悟界"系列,智源研究院的战略转向并非突如其来,而是"水到渠成"。王仲远坦言:"我们认为人工智能最终要造福人类社 会,要帮助大家摆脱繁琐的、重复的、简单的劳动,使得大家能够更多地享 ...
单卡搞定万帧视频理解!智源研究院开源轻量级超长视频理解模型Video-XL-2
量子位· 2025-06-04 05:21
Core Viewpoint - The article discusses the release of Video-XL-2, a new generation of long video understanding model developed by Zhiyuan Research Institute in collaboration with Shanghai Jiao Tong University, which significantly enhances the capabilities of open-source models in processing and understanding long video content [1][3]. Technical Overview - Video-XL-2 is designed with three core components: Visual Encoder, Dynamic Token Synthesis (DTS), and Large Language Model (LLM) [4][6]. - The model utilizes SigLIP-SO400M as the visual encoder to process video frames into high-dimensional visual features, which are then fused and compressed by the DTS module to extract semantic dynamic information [6][11]. - The training strategy involves a four-stage progressive training design to build robust long video understanding capabilities [8][10]. Performance Improvements - Video-XL-2 shows superior performance in long video understanding tasks, achieving leading levels on benchmarks such as MLVU, Video-MME, and LVBench compared to existing open-source models [9][15]. - The model can efficiently process videos of up to 10,000 frames on a single high-performance GPU, significantly extending the length of videos it can handle [19][23]. - It can encode 2048 frames of video in just 12 seconds, demonstrating remarkable speed and efficiency [24][28]. Application Potential - Video-XL-2 has high application potential in various real-world scenarios, including film content analysis, plot understanding, and anomaly detection in surveillance videos [28][30]. - Specific examples of its application include answering questions about movie scenes and detecting unexpected events in surveillance footage [30][32].
腾讯研究院AI速递 20250604
腾讯研究院· 2025-06-03 14:49
Group 1 - Microsoft launched Bing Video Creator, supported by OpenAI's Sora technology, allowing users to generate various types of videos through natural language [1] - The service is free and offers two generation modes: quick and standard, with an initial allowance of 10 quick generation opportunities, producing videos of 5 seconds in length [1] - Built-in safety measures are included to prevent misuse, and each generated video is tagged with content credentials and traceability information; currently, it is not available in the national region [1] Group 2 - Manus introduced a new slide feature that can generate 8 professional PPT slides in 10 minutes, receiving positive feedback [2] - The testing process showed that Manus can automatically search for information, plan structure, and generate content, supporting instant modifications and various export formats, although there are issues with incomplete page displays [2] - Compared to Genspark, Manus is faster (10 minutes vs. 20 minutes) and more powerful, being rated as the best PPT creation tool currently [2] Group 3 - Character.ai launched AvatarFX, enabling static images to speak, sing, and interact with users [3] - AvatarFX is based on the DiT architecture, featuring high fidelity and strong temporal consistency, maintaining stability even in complex scenarios with multiple characters and long sequences [3] - Character.ai also introduced several AI creation features, including immersive narrative experiences and animated chat, while facing an antitrust investigation regarding Google's acquisition of the platform [3] Group 4 - Fellou 2.0 was officially released, functioning as an intelligent agent similar to "Jarvis," enabling 24/7 batch production of AI tasks [4][5] - The new version boasts improved speed (1.2-1.5 times faster), enhanced capabilities (supporting diverse delivery), and increased reliability (success rate improved from 31% to 80%) [5] - Built on the new Eko 2.0 architecture, it supports parallel processing of multiple tasks and plans to release a Windows version while continuously optimizing user experience and model intelligence [5] Group 5 - YouWare is an "ambient programming" platform designed for creators in the AI era, allowing non-programmers to convert ideas into web pages and share them online [6] - The platform's core advantage lies in its "what you see is what you think" experience, where users describe their ideas, and AI generates code for immediate visualization and sharing [6] - YouWare is supported by self-developed AI Agent and Sandbox technology, creating a community similar to "Instagram" and implementing a "Knot" reward mechanism to encourage quality content creation [6] Group 6 - Zhiyuan Research Institute open-sourced the lightweight long video understanding model Video-XL-2, capable of efficiently processing video inputs of up to ten thousand frames on a single card [7] - The model consists of a visual encoder, dynamic token synthesis module, and a large language model, employing a four-stage progressive training method and introducing a segmented pre-filling strategy [7] - Video-XL-2 outperforms all lightweight open-source models on mainstream evaluation benchmarks, encoding 2048 frames of video in just 12 seconds, applicable in film content analysis and anomaly behavior monitoring [7] Group 7 - Salesforce, the leading global CRM platform, acquired the AI Agent platform Moonhub, with the entire team joining Salesforce to develop the Agentforce platform [8] - Salesforce CEO Marc Benioff is optimistic about the development of intelligent agents, aiming to create one billion agents through Agentforce by the end of 2025, with 3,000 paying customers already onboard [8] - Moonhub specializes in recruiting intelligent agents, autonomously searching and screening candidates, complementing Salesforce's existing HR intelligent agent functions and enhancing its influence in the intelligent agent sector [8] Group 8 - Li Feifei's World Labs open-sourced the Forge renderer, enabling real-time rendering of AI-generated 3D worlds on ordinary devices [10] - Forge is a web-based 3D Gaussian splat (3DGS) renderer, seamlessly integrating with three.js, supporting multiple splat objects, cameras, and real-time animation/editing [10] - The technology's key lies in an efficient painter's algorithm for sorting issues and a programmable data pipeline, allowing developers to handle AI-generated 3D worlds as easily as processing triangular meshes [10] Group 9 - The report discusses the model selection guide by Kapasi, recommending GPT-4o for simple daily questions and switching to o3 for complex tasks [11] - Specific usage scenarios include 40% for simple daily questions with 4o, 40% for complex important issues with o3, and using GPT-4.1 for code refinement [11] - The core principle for model selection is "either-or": first determine if the task is important and if one is willing to wait (choose o3) or if it is unimportant and needs quick understanding (choose 4o) [11] Group 10 - ChatGPT's memory system consists of two main components: saving memories and chat history, which is further divided into current session history, dialogue history, and user insights [12] - The technical implementation of memory saving is achieved through bio tools, while dialogue history utilizes vector space to establish multi-layer indexing [12] - The user experience is significantly enhanced by the memory mechanism, particularly the user insight system, which may contribute over 80% to ChatGPT's improved understanding, transforming it from "you tell me" to "I can see" [12]
万帧?单卡!智源研究院开源轻量级超长视频理解模型Video-XL-2
机器之心· 2025-06-03 04:06
Core Viewpoint - The article discusses the release of Video-XL-2, a new generation long video understanding model developed by Zhiyuan Institute in collaboration with Shanghai Jiao Tong University, which significantly enhances the capabilities of multimodal large models in understanding long video content [2][6]. Technical Overview - Video-XL-2 consists of three core components: Visual Encoder, Dynamic Token Synthesis (DTS), and Large Language Model (LLM) [3]. - The model uses SigLIP-SO400M as the visual encoder to process video frames into high-dimensional visual features, which are then fused and compressed by the DTS module to extract semantic dynamic information [3]. - The training strategy involves a four-stage progressive training design to build strong long video understanding capabilities, utilizing image/video-text pairs and large-scale high-quality datasets [4]. Performance Metrics - Video-XL-2 outperforms existing lightweight open-source models on mainstream long video evaluation benchmarks such as MLVU, Video-MME, and LVBench, achieving state-of-the-art performance [11]. - The model can efficiently process videos of up to 10,000 frames on a single high-performance GPU, significantly extending the length of videos it can handle compared to previous models [16]. - Video-XL-2 encodes 2048 frames of video in just 12 seconds, showcasing its superior processing speed and efficiency [19]. Efficiency Innovations - The model incorporates a chunk-based pre-filling strategy to reduce computational costs and memory usage by dividing long videos into segments [8]. - A bi-granularity key-value (KV) decoding mechanism allows the model to selectively load dense or sparse KVs based on task requirements, enhancing decoding efficiency [8]. Application Potential - Video-XL-2 demonstrates high application potential in various scenarios, including film plot question answering, surveillance anomaly detection, and content summarization for films and game live streams [20][22]. - The model's advanced video understanding capabilities provide effective support for complex video analysis needs in real-world applications [20].