机器之心
Search documents
智能体如何学会「想象」?深度解析世界模型嵌入具身系统的三大技术范式
机器之心· 2025-12-22 04:23
Core Insights - The article discusses the integration of world models into embodied intelligent systems, emphasizing the shift from reactive loops to predictive capabilities [2][10] - It highlights the importance of world models in enhancing sample efficiency, long-term reasoning, safety, and proactive planning in embodied agents [11][12] Summary by Sections Introduction to World Models - Embodied intelligent systems traditionally relied on a "perception-action" loop, lacking the ability to predict future states [2] - The introduction of world models allows agents to "imagine" future scenarios, enhancing their operational capabilities [10] Research Overview - A comprehensive survey from a research team involving multiple universities presents a framework for integrating world models into embodied systems [5][7] - The paper categorizes existing research into three paradigms based on architectural integration [5][14] Paradigm Classification - The relationship between world models (WM) and policy models (PM) is described as a "coupling strength spectrum," ranging from weak to strong dependencies [15] - Three categories are identified: Modular, Sequential, and Unified architectures, each with distinct characteristics [15][16] Modular Architecture - In this architecture, WM and PM operate as independent modules with weak coupling, focusing on causal relationships between actions and states [20] - The world model acts as an internal simulator, allowing agents to predict outcomes based on potential actions [20] Sequential Architecture - This architecture involves a two-stage process where WM predicts future states, and PM executes actions based on those predictions [21] - The world model generates a valuable goal, simplifying complex long-term tasks into manageable sub-problems [22][23] Unified Architecture - The unified architecture integrates WM and PM into a single end-to-end network, allowing for joint training and optimization [24][25] - This configuration enables the agent to anticipate future states and produce appropriate actions without explicitly separating simulation and decision-making [25] Future Directions - The article outlines potential research directions, including the representation space of world models, structured intent generation, and the balance between interpretability and optimality [27][28][29] - It emphasizes the need for effective alignment mechanisms to ensure performance while exploring unified world-policy model paradigms [29]
陈天桥旗下盛大AI东京研究院于SIGGRAPH Asia正式亮相,揭晓数字人和世界模型成果
机器之心· 2025-12-22 04:23
Core Insights - Shanda Group's Shanda AI Research Tokyo made its debut at SIGGRAPH Asia 2025, focusing on "Interactive Intelligence" and "Spatiotemporal Intelligence" in digital human research, reflecting the long-term vision of founder Chen Tianqiao [1][10] - The article discusses the systemic challenges leading to the "soul" deficiency in current digital human interactions, which is a significant barrier to user engagement despite substantial investments in visual effects [2][3] Systemic Challenges - **Long-term Memory and Personality Consistency**: Current large language models (LLMs) struggle with maintaining a stable personality over extended conversations, leading to "persona drift" and inconsistent narrative logic [3] - **Lack of Multimodal Emotional Expression**: Digital humans often exhibit "zombie-face" phenomena, lacking natural micro-expressions and emotional responses, which diminishes immersive experiences [3] - **Absence of Self-evolution Capability**: Most digital humans operate as passive systems, unable to learn from interactions or adapt to user preferences, hindering their evolution into truly intelligent entities [3] Industry Consensus - Experts at the SIGGRAPH Asia conference reached a consensus that the bottleneck in digital human development has shifted from visual fidelity to cognitive and interaction logic, emphasizing the need for long-term memory, multimodal emotional expression, and self-evolution as core competencies [13][10] Introduction of Mio - Shanda AI Tokyo Research introduced Mio (Multimodal Interactive Omni-Avatar), a framework designed to transform digital humans from passive entities into intelligent partners capable of autonomous thought and interaction [16][22] - Mio's architecture includes five core modules: Thinker (cognitive core), Talker (voice engine), Facial Animator, Body Animator, and Renderer, which work together to create a seamless interaction loop [20][21] Performance Metrics - Mio achieved an overall Interactive Intelligence Score (IIS) of 76.0, representing an 8.4 point improvement over previous technologies, setting a new performance benchmark in the industry [25][22] Future Outlook - The development of Mio signifies a paradigm shift in digital human technology, moving focus from static visual realism to dynamic, meaningful interactive intelligence, with potential applications in virtual companionship, interactive storytelling, and immersive gaming [22][25] - Shanda AI Tokyo Research has made the complete technical report, pre-trained models, and evaluation benchmarks of the Mio project publicly available to foster collaboration in advancing this field [28]
瞄准AI、图形顶端战场:摩尔线程上演国产GPU硬核实力路演
机器之心· 2025-12-22 04:23
Core Viewpoint - The article highlights the unveiling of the latest AI computing card S5000 by Moore Threads, showcasing significant advancements in AI computing capabilities and the introduction of the MUSA architecture, which aims to support various AI and graphics computing needs [1][3][5]. Group 1: MUSA Architecture Overview - The MUSA (Meta-computing Unified System Architecture) is a comprehensive technology stack developed by Moore Threads, covering chip architecture, instruction sets, programming models, and software frameworks, serving as the foundation for all products [7]. - The architecture features a new "Huagang" design that improves computing density by 50% and energy efficiency by 10 times compared to the previous generation [9]. - MUSA supports mainstream GPU ecosystems and various CPU systems, ensuring security through a hardware-based protection mechanism [9][10]. Group 2: New Chip Developments - The upcoming chips "Huashan" and "Lushan" are designed for AI computing and professional graphics rendering, respectively, with "Huashan" positioned to compete with top international AI chips [18][21]. - "Huashan" features a dedicated large language model acceleration engine and supports high-speed interconnects for large-scale clusters, achieving a performance level comparable to leading global products [22][23]. - "Lushan" aims to address performance bottlenecks in gaming and professional design, boasting a 15-fold increase in 3A game performance compared to the previous generation [25]. Group 3: High-Performance Computing Infrastructure - Moore Threads introduced the Kuai'e 2.0 super AI infrastructure, capable of 10 Exa-FLOPS, supporting trillion-parameter model training with over 60% utilization efficiency [31]. - The company plans to launch the MTT C256 super node product, enhancing GPU deployment density and reducing bandwidth loss [31][33]. Group 4: Future Directions and Ecosystem Development - The company is expanding its focus beyond large models to include embodied intelligence, AI for Science, quantum computing, and AI for 6G, indicating a broad vision for future computing applications [35][36]. - Moore Threads has initiated the "Moore Academy" to train GPU developers and researchers, engaging over 100,000 students across more than 200 universities [40]. - The MTT AIBOOK, an AI computing notebook, is designed to lower the development barrier for AI applications, integrating various processing units and supporting multiple operating systems [42][44].
人均「95后」,账上超十亿美金,MiniMax叩响港股大门
机器之心· 2025-12-21 17:22
Core Viewpoint - The rapid pace of IPOs for AI startups, exemplified by MiniMax, highlights the growing significance and potential of artificial general intelligence (AGI) in the market [2][5]. Company Overview - MiniMax, founded in December 2021 and headquartered in Shanghai, focuses on developing multi-modal artificial intelligence technologies [4]. - The company is recognized for its foundational models MiniMax M1 and M2, as well as AI-native products like Hailuo AI and Xingye [4]. Market Position and User Base - MiniMax is poised to set a record as the fastest AI company from establishment to IPO [5]. - The company has over 2.12 billion personal users across more than 200 countries and regions, with over 100,000 enterprises and developers [9]. - The average monthly active users of AI-native products surged from 3.14 million in 2023 to 27.62 million in the first nine months of 2025, indicating strong user engagement [9]. Financial Performance - For the first nine months of 2025, MiniMax reported revenues of $53.44 million, a year-on-year increase of approximately 174.7% [9]. - The revenue primarily comes from AI-native product subscriptions and enterprise services, with a gross margin of 69.4% for B2B services [10]. - The adjusted net loss for the same period was $186 million, reflecting a slight increase of 8.6% despite significant revenue growth [22]. Research and Development - MiniMax has a strong focus on R&D, with expenditures reaching $180 million in the first nine months of 2025, accounting for 337.4% of total revenue [22]. - The company has made significant advancements in multi-modal AI technologies, launching several models that have gained international recognition [13][15]. Leadership and Organizational Structure - The executive board of MiniMax is notably young, with an average age of 32, reflecting the company's tech-driven and innovative culture [29][30]. - The leadership team is deeply involved in R&D and business operations, aligning with the company's focus on long-term technological investment and efficiency [31]. Future Plans - MiniMax plans to use approximately 70% of the funds raised from the IPO for R&D over the next five years, focusing on model upgrades and AI-native product development [34]. - The company aims to enhance social productivity and individual quality of life through its vision of "Intelligence with Everyone" [37].
遥遥无期的AGI是画大饼吗?两位教授「吵起来了」
机器之心· 2025-12-21 04:21
Core Viewpoint - The article discusses the limitations of achieving Artificial General Intelligence (AGI) due to physical and resource constraints, emphasizing that scaling alone is not sufficient for significant advancements in AI [3][20][32]. Group 1: Limitations of AGI - Tim Dettmers argues that AGI will not happen because computation is fundamentally physical, and there are inherent limitations in hardware improvements and scaling laws [8][10][12]. - The article highlights that as transistor sizes shrink, while computation becomes cheaper, memory access becomes increasingly expensive, leading to inefficiencies in processing power [11][17]. - The concept of "superintelligence" is critiqued as a flawed notion, suggesting that improvements in intelligence require substantial resources, and thus, any advancements will be gradual rather than explosive [28][29][30]. Group 2: Hardware and Scaling Challenges - The article points out that GPU advancements have plateaued, with significant improvements in performance per cost ceasing around 2018, leading to diminishing returns on hardware investments [16][17]. - Scaling AI models has become increasingly costly, with the need for linear improvements requiring exponential resource investments, indicating a nearing physical limit to scaling benefits [20][22]. - The efficiency of current AI infrastructure is heavily reliant on large user bases to justify the costs of deployment, which poses risks for smaller players in the market [21][22]. Group 3: Divergent Approaches in AI Development - The article contrasts the U.S. approach of "winner-takes-all" in AI development with China's focus on practical applications and productivity enhancements, suggesting that the latter may be more sustainable in the long run [23][24]. - It emphasizes that the core value of AI lies in its utility and productivity enhancement rather than merely achieving higher model capabilities [24][25]. Group 4: Future Directions and Opportunities - Despite the challenges, the article suggests that there are still significant opportunities for improvement in AI systems through better hardware utilization and innovative model designs [39][45][67]. - It highlights the potential for advancements in training efficiency and inference optimization, indicating that current models are not yet fully optimized for existing hardware capabilities [41][43][46]. - The article concludes that the path to more capable AI systems is not singular, and multiple avenues exist for achieving substantial improvements in performance and utility [66][69].
震撼,英伟达新模型能打遍几乎所有游戏
机器之心· 2025-12-21 04:21
Core Viewpoint - The article introduces Nvidia's latest open-source model, NitroGen, which is capable of playing over 1,000 different games using AI-generated controls, showcasing significant advancements in gaming automation and cross-game adaptability [5][6][8]. Group 1: Model Overview - NitroGen is designed to play a wide variety of game genres, including RPGs, platformers, and racing games, by directly processing game video frames to generate controller signals [6][8]. - The model supports fine-tuning for new games, allowing it to adapt quickly without starting from scratch, demonstrating its potential for cross-game generalization [8]. - The architecture of NitroGen is based on the GR00T N1.5 framework, which was originally designed for robotics but has been adapted for gaming applications with minimal modifications [12]. Group 2: Key Components - NitroGen consists of three core components: a multi-game intelligent agent, a universal simulator, and a large-scale dataset of gaming videos [15][16][17]. - The multi-game intelligent agent can generate controller commands from game observations, enabling zero-shot gameplay across various titles [15]. - The universal simulator standardizes interactions across different games using the Gymnasium API, facilitating large-scale training and evaluation [16]. - The dataset comprises 40,000 hours of publicly available gaming videos, covering over 1,000 games, and includes automatically generated action labels [17][24]. Group 3: Data Collection and Processing - The data collection process involved extracting player actions from videos with "input overlays," which present real-time controller inputs [18][19]. - The research team utilized advanced techniques to match key points and segment the controller displays from the videos, ensuring the model learns without "cheating" [21]. - The dataset features a diverse distribution of game types, with action RPGs making up 34.9% of the total video duration, followed by platformers at 18.4% [26]. Group 4: Performance and Results - NitroGen has demonstrated strong performance across various game types, including 3D action games and 2D platformers, achieving non-trivial task completion rates [28][30]. - The model showed a significant improvement in task success rates when fine-tuned for new games, with up to a 52% relative increase compared to models trained from scratch [32]. - The research indicates that NitroGen is a foundational step towards creating general-purpose embodied agents capable of interacting with complex environments [35][36].
挑战WorldLabs:Visionary,一个全面超越Marble底层渲染器的WebGPU渲染平台
机器之心· 2025-12-21 04:21
Core Insights - The article discusses the development of Visionary, a new rendering platform that utilizes WebGPU and ONNX to enhance the visualization and interaction of World Models in web environments, overcoming limitations faced by previous technologies like SparkJS [2][10][27]. Group 1: Challenges in Current Technologies - The existing World Model visualization methods, particularly those relying on WebGL, face significant limitations in rendering dynamic and complex scenes due to CPU sorting bottlenecks [6][7][8]. - Current solutions like SparkJS are primarily designed for static or pre-computed Gaussian rendering, making them inadequate for real-time inference of dynamic 3D Gaussian Splatting (3DGS) and Neural Avatars [7][8]. Group 2: Visionary's Innovations - Visionary is positioned as a native web rendering substrate that integrates GPU computation and rendering directly into browsers, replacing the older WebGL framework [10][25]. - It introduces a Gaussian Generator Contract that standardizes the output of various 3DGS and 4DGS methods into ONNX format, allowing for dynamic generation and updating of Gaussian attributes in real-time [11][13]. Group 3: Performance and Quality Improvements - Experimental data indicates that Visionary significantly outperforms SparkJS in rendering efficiency, particularly in scenes with millions of Gaussian points, by shifting sorting and preprocessing tasks to the GPU [18][21]. - Visionary employs frame-by-frame GPU global sorting to eliminate visual artifacts seen in other solutions, ensuring accurate rendering of transparency even in complex multi-model scenarios [21][24]. Group 4: Applications and Future Directions - Visionary serves as a unified platform for researchers, creators, and industries, enabling quick reproduction and comparison of 3DGS variants, as well as facilitating editing and rendering directly in the browser [24][25]. - The development team views Visionary as a foundational step towards a comprehensive World Model framework, with future explorations planned in areas such as physical interaction enhancement and spatial intelligence [26][28].
相机运动误差降低40%!DualCamCtrl:给视频生成装上「深度相机」,让运镜更「听话」
机器之心· 2025-12-21 04:21
本研究的共同第一作者是来自于香港科技大学(广州)EnVision Research 的张鸿飞(研究助理)和陈康豪(博士研究生),两位研究者均师从陈颖聪教 授。 你的生成模型真的「懂几何」吗?还是只是在假装对齐相机轨迹? 当前众多视频生成模型虽宣称具备「相机运动控制」能力,但其控制信号通常仅依赖于相机位姿。虽近期工作通过逐像素射线方向(Ray Condition)编码 了运动信息,但由于模型仍需隐式推断三维结构,本质上仍缺乏对场景的显式几何理解。这一局限性导致了相机运动的不一致——模型受限于外观与结构两 种表征信息的耦合,无法充分捕捉场景的底层几何特征。 鉴于上述挑战, 来自香港科技大学、复旦大学等机构的研究团队提出了一种全新的端到端几何感知扩散模型框架 DualCamCtrl 。 该研究针对现有方法在 场景理解与几何感知方面的不足,创新性地设计了一个「双分支扩散架构」,能够同步生成与镜头运动一致的 RGB 与深度序列。进一步地,为实现 RGB 与深度两种模态的高效协同,DualCamCtrl 提出了语义引导互对齐机制(Semantic Guided Mutual Alignment),该机制以语义信息为指导, ...
近两百万人围观的Karpathy年终大语言模型清单,主角是它们
机器之心· 2025-12-21 03:01
编辑|杜伟 2025 年还有 10 天就要结束,这意味着是时候进行一波年终总结了。 对于人工智能领域而言,2025 年是大语言模型(LLM)快速演进、重磅事件密集出现的一年。 就在昨天,知名 AI 学者 Karpathy 列出了一份清单,记录了他个人认为最重要、也多少有些出乎意料的「范式转变」。 这些真正改变了行业格局、并在概念层面让 Karpathy 印象深刻的变化会落在哪些领域呢?我们接下来一一来看(以第一人称)。 可验证奖励强化学习(RLVR) 2025 年初,几乎所有实验室的 LLM 生产训练流程都像下面这样: 这套流程稳定、可靠,曾长期被视为「工业级 LLM」的标准做法。 预训练(类似 2020 年的 GPT-2/3); 监督微调(SFT,类似 2022 年的 InstructGPT) 基于人类反馈的强化学习(RLHF,约 2022 年) 但在 2025 年,一种新的阶段浮出水面,并迅速成为事实上的标配: 可验证奖励强化学习(Reinforcement Learning from Verifiable Rewards,RLVR) 。 RLVR 的核心做法是,让模型在可自动验证的环境中接受强化学习训练 ...
AI一旦开始「内卷」,会变成什么样?腾讯混元和上交联合揭秘多智能体「饥饿游戏」
机器之心· 2025-12-21 03:01
在多智能体系统的想象中,我们常常看到这样一幅图景: 多个 AI 智能体分工协作、彼此配合,像一个高效团队一样攻克复杂任务,展现出超越单体智能的 "集体智慧"。 但一个关键问题常常被忽略: 当这些智能体不再只是 "同事",而是被迫变成 "竞品",甚至是 "对手",会发生什么? 腾讯混元数字人团队与上海交通大学的最新研究,给出了一个颇为刺眼的回答: 当面临极端竞争压力时,LLM 多智能体系统会出现严重的 "过度竞争" 行为,沉迷互踩、内卷和博弈,直接拖垮整体任务表现。 换句话说, 当我们把 AI 扔进一场 "饥饿游戏" ,它们会开始变坏。 论文链接:https://arxiv.org/abs/2509.26126 项目地址:https://github.com/Tencent/DigitalHuman/tree/main/HATE 「饥饿游戏」式辩论: 只有一个能活下来 这项研究设计了一个高风险、零和博弈的辩论环境,让智能体在 "合作完成任务" 与 "避免被淘汰" 之间做出选择。 为了让竞争足够残酷,系统给每个智能体植入了清晰的 "生存本能" 提示: 只会有一名胜者,其余全部被移除。 整个框架可以理解为一场 AI ...