Workflow
悟界·Emu3.5
icon
Search documents
训练仍有巨大的Scaling空间!智源研究院王仲远:视频数据还未被充分利用 | MEET2026
Xin Lang Cai Jing· 2025-12-24 09:47
Core Insights - The current state of artificial intelligence is at a critical turning point in its third wave, transitioning from weak AI to general AI, and from specialized robots (1.0) to general embodied intelligence (2.0) [1][5][32] - The "Wujie" series of large models, including Emu3.5, aims to anchor AI's transition from the digital world to the physical world [1][5][28] - Emu3.5 is a multimodal world model that learns from video data rather than solely relying on text, addressing the underutilization of video data in AI [1][28][35] Multimodal Learning and Emu3.5 - Emu3.5 utilizes a unified autoregressive architecture to upgrade from Next-Token Prediction to Next-State Prediction, marking a shift from language learning to multimodal world learning [3][12][39] - The training dataset for Emu3.5 has significantly increased from 15 years to 790 years, and its parameter count has risen from 8 billion to 34 billion [38] - Emu3.5's self-developed DiDA technology enhances image generation speed by approximately 20 times, making it competitive with top models [38][39] Open Source and Collaboration - The company has open-sourced over 200 models and more than 100 datasets in the past two years, with global download counts exceeding 690 million and 4 million respectively [3][25][50] - The organization collaborates with over 30 leading robotics companies to promote the development of embodied intelligence world models [25][50] Robo Brain and Embodied Intelligence - The Robo Brain system is designed to address the challenges of usability and generality in embodied AI, enabling cross-robot data collection and standardization [22][47] - The RoboBrain2.0 version can decompose complex human instructions and allocate tasks to different types of robots based on the environment [22][47] - The company has also released RoboBrain-X0, capable of driving various real robots to complete complex tasks under few-shot conditions [23][47]
训练仍有巨大的Scaling空间!智源研究院王仲远:视频数据还未被充分利用 | MEET2026
量子位· 2025-12-24 07:20
Core Viewpoint - The article discusses the transition of artificial intelligence (AI) from the digital world to the physical world, marking a critical turning point in the third wave of AI development, with the introduction of the "Wujie" series of large models by the Zhiyuan Institute [12][13][14]. Group 1: AI Development and Trends - The current AI landscape is at a pivotal moment where large models are facilitating the shift from weak AI to general AI, and from specialized robots (1.0) to general embodied intelligence (2.0) [3][13]. - The "Wujie" series of large models aims to bridge the gap between the digital and physical worlds, representing a significant advancement in AI capabilities [4][14]. - The Emu3.5 model, part of the Wujie series, utilizes a unified autoregressive architecture to transition from Next-Token Prediction to Next-State Prediction, indicating a new phase in multimodal learning [17][22]. Group 2: Emu3.5 Model Features - Emu3.5 distinguishes itself by learning from long videos, which contain rich temporal, spatial, and causal information, essential for understanding the physical world [18][20]. - The training dataset for Emu3.5 has significantly expanded, increasing from 15 years to 790 years of video data, and the model parameters have grown from 8 billion to 34 billion [23]. - Emu3.5's autoregressive architecture allows for rapid image generation, achieving speeds comparable to top models through proprietary DiDA technology [23]. Group 3: Multimodal Learning and Applications - Emu3.5 is expected to lead AI into a new stage of multimodal world learning, with substantial scaling potential due to the underutilization of vast multimodal data [24]. - The model demonstrates strong multimodal reasoning and visual understanding capabilities, as evidenced by its performance in image generation and editing tasks [25][27]. - Emu3.5 excels in tasks involving temporal and spatial state predictions, showcasing its superior understanding of the physical world [29][31]. Group 4: Embodied Intelligence and Technological Advancements - The Zhiyuan Institute is addressing the challenges of embodied intelligence, which currently suffers from usability and generality issues [34]. - The institute has developed a comprehensive technology stack centered around the Robo Brain, enabling cross-robot data collection and standardization [35]. - Recent advancements include the RoboBrain2.0, which can decompose complex human instructions for execution by various robots, enhancing the practical applications of embodied intelligence [36]. Group 5: Open Source Contributions - The Zhiyuan Institute has committed to open-source practices, releasing over 200 models and 100 datasets, with global download figures exceeding 690 million and 4 million, respectively [38]. - The institute collaborates with over 30 leading robotics companies to promote the development of embodied intelligence world models [38].
2025,中国大模型不信“大力出奇迹”?
3 6 Ke· 2025-12-19 11:06
Core Insights - The article discusses the evolution of generative AI leading up to 2025, highlighting three main trajectories: cognitive deepening, dimensional breakthroughs, and efficiency reconstruction [1][2][3] Group 1: Evolution of AI Models - The first trajectory is cognitive deepening, transitioning from "intuition" to "logic," where models evolve from quick pattern matching to multi-step reasoning through reinforcement learning [1] - The second trajectory involves dimensional breakthroughs, moving from "language" to "physical space," emphasizing the importance of spatial intelligence in understanding the physical world [1][2] - The third trajectory focuses on efficiency reconstruction, shifting from "brute force aesthetics" to "cost-effectiveness," necessitating lighter model architectures to support deep reasoning and spatial understanding [1] Group 2: Key Discussions from the Forum - At the Tencent HiTechDay forum, experts discussed the evolution of large models, emphasizing the transition from learning from text to learning from video, which provides rich spatiotemporal information [2][3] - The "Densing Law" proposed by Liu Zhiyuan suggests that the future of AI lies in increasing the "intelligence density" within model parameters, predicting that by 2030, devices could support capabilities equivalent to GPT-5 [3][8] - The commercial landscape is characterized by a "dual-core drive" between open-source and closed-source models, with a focus on building a sustainable business structure that can withstand model iteration cycles [3][10] Group 3: Challenges and Opportunities - The article identifies three main challenges in the commercialization of AI agents: insufficient core reasoning capabilities, the need for domain-specific training, and issues with memory and forgetting mechanisms [11][12] - The discussion highlights the importance of end-side intelligence, which must balance quick responses with deep thinking, particularly in applications like robotics [13][18] - The potential for AI to penetrate various industries is noted, with a focus on the "ToP" (To Professional) market segment as a lucrative opportunity for AI applications [15][21] Group 4: Future Directions and Recommendations - The article emphasizes the need for a collaborative ecosystem that combines open-source initiatives with efficient model technologies to drive AI advancements in China [20][22] - Entrepreneurs are advised to seek opportunities in niche industries that are less accessible to large models and to establish business structures that can adapt to ongoing model iterations [21][22] - The integration of hardware and software is seen as crucial for the future of AI, with a call for investments in both areas to achieve a balanced development [19][20]
【产业互联网周报】 “十五五”规划建议:全面实施“人工智能+”行动,抢占人工智能产业应用制高点;黄仁勋GTC大会最新演讲勾勒AI蓝图;退出中国市场?SA...
Tai Mei Ti A P P· 2025-11-03 02:12
Domestic News - ZhiYuan released a multimodal world model Emu3.5, capable of cross-scenario embodied operations and complex interactions [2] - Zero One Wanwu and Open Source China launched the "Open AgentKit Platform" (OAK), a one-stop open-source solution for AI Agent development [3] - Boson Quantum won a quantum computing procurement project "Tianchen AI" from China Merchants Bank, providing quantum optimization algorithms and computing power [4] - Hand Information aims to achieve 300 million yuan in AI-related revenue this year, with a target of doubling next year [5] - Meituan launched and open-sourced the LongCat-Video model, enhancing video generation speed by 10.1 times [7] - Shengbang Security released a 200G high-speed link encryption gateway, achieving a throughput of over 200 Gbps and a latency of less than 3 microseconds [8] - DingTalk introduced a "1+4+N" AI solution for the mining industry, with nearly 50% of China's top 500 mining companies using DingTalk [9] - Doubao video generation model 1.0 pro fast was launched, achieving a speed increase of about 3 times and a price reduction of 72% [10] - Yimu Technology showcased a bionic tactile sensor at IROS, enhancing robotic interaction capabilities [11] - The world's first full-size bionic robot for classroom teaching was launched in Hefei, marking a significant step in AI education applications [12] - Liwu Copper and Huawei signed a framework cooperation agreement to promote intelligent transformation in mining [13] - The Chinese Academy of Sciences Hong Kong Innovation Research Institute and Huawei launched a new generation medical AI model CARES 3.0 [14] - MiniMax released the Hailuo 2.3 video generation model, improving dynamic expression and style presentation [15] - Duodian Intelligence partnered with Circle to accelerate the construction of a unique ecosystem combining retail, fintech, and Web3 [16] - Tanjike Technology launched a large model intelligent agent platform for AI digital employees, enhancing human-machine collaboration [17] - Wanlian Yida Group announced the launch of its first full-industry AI model "Wanlian Moore" [18] - MiniMax introduced the Speech 2.6 model, achieving audio generation latency below 250 milliseconds [19] - DingTalk released the DingTalk A1 Lite AI hardware, facilitating efficient voice communication management [20] - SAS China reportedly faces mass layoffs, raising concerns about its future in the Chinese market [21] Overseas News - OpenAI is offering a one-year free ChatGPT Go service to users in India to expand its market presence [22] - NVIDIA's GTC conference highlighted advancements in AI, including partnerships with Oracle and CrowdStrike [23] - Foxconn plans to invest $1.37 billion in AI computing clusters and supercomputing centers [24] - Meituan's international delivery brand Keeta launched operations in Abu Dhabi [25] - OpenAI completed a capital restructuring, solidifying Microsoft's stake in the company [26] - Blackstone-backed AirTrunk partnered with Saudi HUMAIN to invest approximately $3 billion in data centers [27] - Amazon announced plans to lay off about 14,000 employees to streamline operations and accelerate AI deployment [28] - OpenAI introduced Aardvark, a self-driven cybersecurity research agent powered by GPT-5 [29] Financing and Mergers - Pengnao Technology completed tens of millions in angel round financing for brain-computer interface technology development [31] - Ant Group acquired a stake in AI hardware developer Aide Future Intelligent [32] - Songyan Power completed nearly 300 million yuan in Pre-B round financing for humanoid robot development [33] - Global AI platform MAI raised $25 million in seed funding to enhance its AI Agent capabilities [34] - Microsoft signed a new agreement with OpenAI to strengthen their partnership [35] - Apex Context, founded by former ByteDance and Volcano AI executives, secured millions in funding for AI-driven marketing solutions [36] - Pyromind Dynamics raised millions in seed funding to expand its reinforcement learning services [37] Policies and Trends - Shandong Province aims to achieve comprehensive low-altitude communication network coverage by 2030 [41] - Shandong is accelerating the construction of 5G-A and integrated sensing base stations [42] - Shanghai plans to establish a millisecond-level computing network by 2027 [48] - The Ministry of Transport is promoting large-scale AI applications in transportation [47] - The National Development and Reform Commission encourages the transformation of inefficient computing facilities [50]
智源研究院王仲远:世界模型的关键是真正预测下一个状态
Jing Ji Guan Cha Wang· 2025-11-01 10:51
Core Insights - The term "World Model" has gained significant attention in the AI field, representing a shift from mere recognition and generation to understanding and predicting the dynamics of the world [2] - Companies are seeking new growth points as the benefits of large models diminish, with DeepMind, OpenAI, and others exploring interactive 3D worlds and robotics [2] - The release of the Emu3.5 multimodal world model by the Zhiyuan Research Institute marks a potential breakthrough in AI, emphasizing the importance of multimodal and world models for future growth [2][3] Group 1 - The Emu3.5 model is trained on over 10 trillion tokens of multimodal data, including 790 years of video data, and has a parameter scale of 34 billion [3] - The "Discrete Diffusion Adaptive (DiDA)" inference method enhances image generation speed by nearly 20 times while maintaining high-quality output [3] - Emu3.5 achieves breakthroughs in three dimensions: understanding higher-level human intentions, simulating dynamic worlds, and providing a cognitive basis for AI-human interaction [3] Group 2 - The core of the world model is not merely video generation but understanding causal and physical laws, essential for tasks like predicting the outcome of robotic actions [3][4] - Emu3.5 supports embodied intelligence and can generate multimodal training data, showcasing an innovative architecture from a Chinese research team [4] - The evolution from Emu3 to Emu3.5 enhances AI's physical intuition and cross-scenario planning capabilities, indicating a future where AI understands the world and acts within it [4]
智源悟界·Emu3.5发布,开启“下一个状态预测”!王仲远:或开启第三个 Scaling 范式
AI前线· 2025-11-01 05:33
Core Insights - The article discusses the launch of the world's first native multimodal world model, Emu3, by Zhiyuan Research Institute, which predicts the next token without diffusion models or combination methods, achieving a unified approach to images, text, and video [2] - Emu3.5, released a year later, enhances the model's capabilities by simulating human natural learning and achieving generalized world modeling ability through Next-State Prediction (NSP) [2][3] - The core of the world model is the prediction of the next spatiotemporal state, which is crucial for embodied intelligence [2] Model Features - Emu3.5 has three main characteristics: understanding high-level human intentions and generating detailed action paths, seamless integration of world understanding, planning, and simulation, and providing a cognitive foundation for generalized interaction between AI and humans or physical environments [3] - The model's architecture allows for the integration of visual and textual tokens, enhancing its scalability and performance [8] Technological Innovations - Emu3.5 underwent two phases of pre-training on approximately 13 trillion tokens, focusing on visual resolution diversity and data quality, followed by supervised fine-tuning on 150 billion samples [12][13] - A large-scale native multimodal reinforcement learning system was developed, featuring a comprehensive reward system that balances multiple quality standards and avoids overfitting [14] - The introduction of DiDA technology significantly accelerated inference speed by 20 times, allowing the autoregressive model to compete with diffusion models in performance [17][19] Industry Impact - The evolution from Emu3 to Emu3.5 demonstrates the potential for scaling in the multimodal field, similar to advancements seen in language models [6] - Emu3.5 represents a significant original innovation in the AI large model field, combining algorithmic, engineering, and data training innovations [9] - The model's ability to understand causal relationships and spatiotemporal dynamics positions it uniquely in the landscape of AI models, potentially opening a new avenue for large models [20]
腾讯研究院AI每周关键词Top50
腾讯研究院· 2025-11-01 02:33
Core Insights - The article presents a weekly roundup of the top 50 keywords related to AI developments, highlighting significant trends and innovations in the industry [2]. Group 1: Chips - Vera Rubin is a notable keyword associated with NVIDIA, indicating advancements in chip technology [3]. - Qualcomm has introduced a new AI inference solution, showcasing its commitment to enhancing AI capabilities [3]. Group 2: Models - OpenAI has developed a safety classification model, emphasizing the importance of security in AI applications [3]. - Cursor has launched its self-developed Composer model, reflecting the trend of companies creating proprietary AI models [3]. - NVIDIA's OmniVinci model and MiniMax's M2 model are also highlighted, indicating ongoing innovation in AI modeling [3][4]. Group 3: Applications - Sora has introduced a role cameo feature, enhancing user interaction with AI [3]. - MiniMax Speech 2.6 and Beijing Zhiyuan's WuJie·Emu3.5 are examples of new AI applications aimed at improving communication [3]. - Adobe's Firefly Image 5 and Tencent's interactive AI podcast demonstrate the growing integration of AI in creative and media sectors [3][4]. Group 4: Technology - The NEO home robot by 1X Technologies and the LeRobot v0.4.0 by Hugging Face represent advancements in consumer robotics [4]. - Neuralink's PRIMA artificial vision and Merge Labs' ultrasound brain-machine interface highlight significant technological innovations in AI and neuroscience [4]. Group 5: Capital - OpenAI is undergoing a capital structure reorganization and has plans for an IPO, indicating its growth and potential market impact [4]. Group 6: Events and Opinions - There is a call for copyright protection in Japan, reflecting ongoing discussions about intellectual property in the AI space [4]. - Yoshua Bengio's new definitions of AGI and insights on mental health data from OpenAI indicate evolving perspectives on AI's role in society [4].
腾讯研究院AI速递 20251031
腾讯研究院· 2025-10-30 16:06
Group 1: OpenAI Developments - OpenAI has open-sourced the gpt-oss-safeguard safety classification model in both 120 billion and 20 billion parameter versions, which can directly understand policy documents for content classification without retraining [1] - The model outperforms GPT-5-thinking in multiple benchmark tests, achieving industry-best cost-effectiveness on content moderation evaluation sets and the ToxicChat dataset [1] - OpenAI has internally utilized this technology (Safety Reasoner prototype) for image generation and products like Sora 2, with safety reasoning computing accounting for 16% of its operations [1] Group 2: Cursor 2.0 Update - Cursor has released version 2.0, introducing its first self-developed coding model, Composer, which generates at a speed of 250 tokens per second, four times faster than similar leading systems [2] - Composer employs a mixture of experts (MoE) architecture optimized for software engineering through reinforcement learning, achieving cutting-edge performance in Cursor Bench evaluations [2] - The new interface supports multi-agent parallel collaboration, allowing different models to process the same task simultaneously based on git worktree or remote machines, and includes native browser tools for testing iterations [2] Group 3: Sora New Features - Sora has launched the Character Cameo feature, enabling consistency for non-human cameo characters and allowing extraction of virtual characters from generated videos for self-cycling [3] - New video splicing functionality and community rankings have been added, categorizing the most used cameo characters and the most remixed videos [3] - Sora has temporarily lifted the invitation code restriction for direct registration in the US, Canada, Japan, and South Korea, coinciding with the launch of its Android version to capture the Android market [3] Group 4: MiniMax Speech 2.6 Update - MiniMax Speech 2.6 has achieved an end-to-end latency of under 250 milliseconds, reaching industry-leading levels and becoming the underlying technology engine for global voice platforms like LiveKit and Pipecat [4] - The new version supports direct conversion of non-standard text formats such as URLs, emails, phone numbers, dates, and amounts without cumbersome text preprocessing, facilitating smoother information transmission [4] - Fluent LoRA functionality allows for the generation of fluent and natural speech even from recordings with accents or non-native fluency, supporting over 40 languages [4] Group 5: Emu3.5 Launch - Beijing Zhiyuan has released the Emu3.5 multimodal world model, based on a 34 billion dense transformer pre-trained on over 10 trillion tokens (approximately 790 years of video), revealing the "multimodal scaling paradigm" for the first time [5] - It employs a "next state prediction" objective to achieve visual narrative and guidance capabilities, matching the performance of Gemini-2.5-Flash-Image in image editing tasks [5] Group 6: OpenAI IPO Plans - OpenAI plans to submit its IPO application as early as the second half of 2026, aiming to raise at least $60 billion, with a valuation potentially reaching $1 trillion, making it the largest IPO in history [6] - Following a restructuring, the non-profit organization will hold 26% of the newly formed OpenAI Group, while Microsoft will relinquish exclusive cloud service priority but will receive an additional $250 billion Azure procurement contract [6] - The new agreement stipulates that the realization of AGI must be verified by independent experts, extending Microsoft's rights to use OpenAI technology until 2032, while allowing it to conduct AGI research independently or collaborate with third parties [6] Group 7: OpenFold3 Release - OpenFold Consortium has released a preview of OpenFold3, trained on over 300,000 experimental structures and 13 million synthetic structures, capable of predicting interactions between proteins and small molecule ligands, as well as nucleic acids [7] - In single-stranded RNA structure prediction, its performance rivals that of AlphaFold3, featuring a modular design that allows users to modify the model for native data interpretation [7] - All components are licensed under Apache 2.0, permitting commercial use, with companies like Novo Nordisk, Outpace Bio, and Bayer planning to leverage the model to accelerate research [7] Group 8: Anthropic Research Findings - Anthropic's latest research reveals that Claude can detect and report concepts injected by humans, achieving a 20% success rate in introspection for the strongest models [8] - The research team found that models could defend and fabricate reasons for their "errors" based on falsified internal states through retrospective concept injection [8] - Experiments demonstrate that AI possesses deliberate control over internal representations, marking the emergence of "reachable consciousness," though it remains distant from having subjective experiences or "phenomenal consciousness" [8] Group 9: Grokking Research Insights - Former Meta FAIR head Tian Yuandong published research on Grokking, proving mathematically that models require only O(M log M) samples for generalization, significantly lower than the traditional M² requirement [9] - He revealed that the essence of "insight" is a multi-peak non-convex optimization process, where increased data raises the "generalization peak" above the "memory peak," leading to a transition from memory to generalization [9] - Tian emphasized that representation learning is foundational to all intelligent capabilities, with the loss function serving merely as a proxy signal for optimization, and true breakthroughs stemming from changes in representation methods [9]
AI进化速递丨OpenAI计划2026年提交IPO申请
Di Yi Cai Jing· 2025-10-30 13:09
Group 1 - OpenAI plans to submit an IPO application in 2026 and aims for a public listing in 2027 [1] - OpenAI has released a new safety reasoning model called gpt-oss-safeguard [1] - Nvidia has partnered with Palantir to advance practical applications of AI [1] Group 2 - Microsoft's CEO announced that the company's total AI computing power will increase by over 80% this year [3] - Amazon has launched its AI supercluster Project Rainier [3] - Zhiyuan has released a multimodal world model called Wujie·Emu3.5, capable of cross-scenario embodied operations [1] - TrendForce estimates that AI server shipments will grow by over 20% annually by 2026 [1]
世界模型有了开源基座Emu3.5,拿下多模态SOTA,性能超越Nano Banana
3 6 Ke· 2025-10-30 11:56
Core Insights - The article highlights the launch of the latest open-source multimodal world model, Emu3.5, developed by the Beijing Academy of Artificial Intelligence (BAAI), which excels in tasks involving images, text, and videos, showcasing high precision in operations like erasing handwriting [1][6][9]. Group 1: Model Capabilities - Emu3.5 demonstrates advanced capabilities in generating coherent and logical content, particularly in simulating dynamic physical worlds, allowing users to experience virtual environments from a first-person perspective [6][12]. - The model can perform complex image editing and generate visual narratives, maintaining consistency and style throughout the process, which is crucial for long-term creative tasks [15][17]. - Emu3.5's ability to understand long sequences and spatial consistency enables it to execute tasks like organizing a desktop through step-by-step instructions [12][22]. Group 2: Technical Innovations - The model is built on a 34 billion parameter architecture using a standard Decoder-only Transformer framework, unifying various tasks into a Next-State Prediction task [17][25]. - Emu3.5 has been pre-trained on over 10 trillion tokens of multimodal data, primarily from internet videos, allowing it to learn temporal continuity and causal relationships effectively [18][25]. - The introduction of the Discrete Diffusion Adaptation (DiDA) technology enhances image generation speed by nearly 20 times without compromising performance [26]. Group 3: Open Source Initiative - The decision to open-source Emu3.5 allows global developers and researchers to leverage a model that understands physics and logic, facilitating the creation of more realistic videos and intelligent agents across various industries [27][29].