Workflow
世界模型
icon
Search documents
人形机器人,缺一个杀手级共识
创业邦· 2025-08-26 03:37
以下文章来源于星河频率 ,作者毛心如 星河频率 . 来 源 丨 星河频率 (ID:robo—wave) 作者丨 毛心如 图源丨 midjourney 蓄力助跑,仅凭一次尝试,星动纪元 L7 就以 95 .64 1cm 的成绩, 创下人形机器人跳高世界纪录 。 171cm 的身高,65kg 的体重,即便是普通人也未必能蹦出来这么高、这么标准的超级玛丽跳。 尽管本届世界人形机器人运动会上不乏各类「翻车」名场面,吸引了不少眼球,不可忽视的是,无论是跑步、跳高还是跳远,这些项目都深度考验了 机器 人「算法+硬件」 高度耦合 的能力。 与此同时,在本届运动会中夺冠次数最多的宇树科技,其创始人王兴兴在世界机器人大会论坛上的发言,却因对当前热门的 VLA 路线提出质疑,而被不少 人称为「炸裂发言」甚至「暴论」。 关注通用机器人的一切。 同样作为冠军团队的星动纪元,其创始人陈建宇却对 VLA 表达出与王兴兴不同的态度。 观点分野的背后,实则是两家公司对 「如何让机器人变得更强大」 这一目标,所采取的不同实践路径—— 一条是「硬件先行」,另一条是「软硬一体、垂 直整合」 。 垂直整合和 硬件先行的观念分野 两位创始人的背景差异 ...
视频生成 vs 空间表征,世界模型该走哪条路?
机器之心· 2025-08-24 01:30
机器之心PRO · 会员通讯 Week 34 --- 本周为您解读 ② 个值得细品的 AI & Robotics 业内要事 --- 1. 视频生成 vs 空间表征,世界模型该走哪条路? 视频预测生成的高质量画面,是否真的意味着模型理解了物理与因果规律?直接在潜在空间建模能否有效避免像素噪声干扰,同时保持决策与规划能力?混合路线是否能成为未来世界模型的 最优路径?随着生成模型和潜在表征技术的发展,AGI 的「思想实验沙盒」能否真正落地应用于物理世界任务?... 2. 抢天才还是拼算力?前 Llama 推理负责人详解 AI 的真实天花板 真正决定 AI 行业天花板的,是天才研究员的灵感,还是指数级增长的算力?如果算力增长放缓,AI 行业会否面临「增长乏力」的拐点?高阶概念想法,如果没有系统实验验证,能否真正推 动模型跃迁?模型泛化的天花板,到底靠升级模型,还是靠设计更高质量的新考题?... 本期完整版通讯含 2 项专题解读 + 30 项本周 AI & Robotics 赛道要事速递,其中技术方面 12 项,国内方面 8 项,国外方面 10 项。 本期通讯总计 20464 字,可免费试读至 9% 消耗 288 微信 ...
拾象 AGI 观察:LLM 路线分化,AI 产品的非技术壁垒,Agent“保鲜窗口期”
海外独角兽· 2025-08-22 04:06
访谈:李广密,张小珺 「全球大模型季报」 是「海外独角兽」和 「张小珺 Jùn|商业访谈录」 的 AI 领域观察栏目,以季 度为单位,拾象 CEO 李广密和财经作者张小珺梳理 LLM 领域的重要信号,预测未来。 • 智能和产品都重要,ChatGPT 身上有很多非技术性壁垒,而 Coding 或模型公司只是技术壁垒; • 做 AI 产品很像挖矿,保鲜窗口很关键,这个窗口期明显在缩短; • ChatGPT 的 Deep Research 和 Anthropic 的 Claude Code 最早交付了 L4 级别的体验,分别对应信息 搜索和软件开发; • 极端来说,Coding 公司不做模型的话,在未来是没有优势的,未来就是比拼成本; 一起开一个脑洞: 如果让你未来 4 年加入一家 AI 公司或者选一个好的 CEO,你的选择会是什么? 欢迎在评论区留下你的答案。 01 . 模型开始分化 Guangmi Li: 希望大家能在这一期记住三个关键信息: 1. 大模型在分化与收敛; 2025 Q2 全球大模型的爆发性比以往更强,硅谷的各个模型公司开始分化到各个领域,比如除了 Google Gemini 和 OpenAI 还 ...
从“内部世界”到虚拟造物:世界模型的前世今生
Jing Ji Guan Cha Bao· 2025-08-21 08:25
文/陈永伟 8月5日,谷歌DeepMind发布了其新模型——Genie 3。 该模型能够根据用户的文本或图像提示,实时生成可供用户与AI智能体(AI Agent)互动的3D虚拟环 境。例如,用户只需输入"月球上的火山边",Genie 3便能即时生成一片浮动的火山、黄色的大地与远 处的宇宙背景,并允许用户进入探索。 相比此前的AI模型,Genie 3展现出更强的实时交互能力,并在互动时长和记忆连贯性上表现尤为出 色。例如,如果用户在生成的房间墙壁上涂鸦,然后转身探索别处,那么当他稍后返回时,墙上的涂鸦 依旧保留。 不仅如此,Genie 3还引入了"可提示的世界事件"(Promptable World Events)功能。这允许用户在交 互过程中,通过新的文本指令动态改变世界。无论用户要求"加入一只奔跑的小狗""把天气从晴天变成 大雨",还是"将环境从海边变成山上",Ge-nie 3都能瞬间响应。 Genie 3的出色表现不仅刷新了AI生成世界的边界,也让人们看到了另一条通向通用人工智能(AGI)的 路径——"世界模型"(World Model)的希望。一时间,关于"世界模型"的讨论频频见诸媒体。 那么,什么是" ...
上下文即记忆!港大&快手提出场景一致的交互式视频世界模型,记忆力媲美Genie3,且更早问世!
量子位· 2025-08-21 07:15
Core Viewpoint - The article discusses a new framework called "Context-as-Memory" developed by a research team from the University of Hong Kong and Kuaishou, which significantly improves scene consistency in interactive long video generation by efficiently utilizing historical context frames [8][10][19]. Summary by Sections Introduction to Context-as-Memory - The framework addresses the issue of scene inconsistency in AI-generated videos by using a memory retrieval system that selects relevant historical frames to maintain continuity [10][19]. Types of Memory in Video Generation - Two types of memory are identified: dynamic memory for short-term actions and behaviors, and static memory for scene-level and object-level information [12][13]. Key Concepts of Context-as-Memory - Long video generation requires long-term historical memory to maintain scene consistency over time [15]. - Memory retrieval is crucial, as directly using all historical frames is computationally expensive; a memory retrieval module is needed to filter useful information [15]. - Context memory is created by concatenating selected context frames with the input, allowing the model to reference historical information during frame generation [15][19]. Memory Retrieval Method - The model employs a camera trajectory-based search method to select context frames that overlap significantly with the current frame's visible area, enhancing both computational efficiency and scene consistency [20][22]. Dataset and Experimental Results - A dataset was created using Unreal Engine 5, containing 100 videos with 7601 frames each, to evaluate the effectiveness of the Context-as-Memory method [23]. - Experimental results show that Context-as-Memory outperforms baseline and state-of-the-art methods in memory capability and generation quality, demonstrating its effectiveness in maintaining long video consistency [24][25]. Generalization of the Method - The method's generalization was tested using various styles of images as initial frames, confirming its strong memory capabilities in open-domain scenarios [26][27]. Research Team and Background - The research was a collaboration between the University of Hong Kong, Zhejiang University, and Kuaishou, led by PhD student Yu Jiwen under Professor Liu Xihui [28][33].
上下文记忆力媲美Genie3,且问世更早:港大和可灵提出场景一致的交互式视频世界模型
机器之心· 2025-08-21 01:03
要让视频生成模型真正成为 模 拟真实物理世界的「世界 模型」 ,必须具备长时间生成并保留场景记忆的能力。然而,交互式长视频生成一直面临一个致命短 板: 缺乏稳定的场景记忆 。镜头稍作移动再转回,眼前景物就可能「换了个世界」。 这一问题严重制约了视频生成技术在游戏、自动驾驶、具身智能等下游应用的落地。8 月初,Google DeepMind 发布的 Genie 3 引爆 AI 圈,以其在长视频生成中依 旧保持极强场景一致性的能力,被视为世界模型领域的质变之作。不过遗憾的是, Genie 3 并未公开任何技术细节 。 来自 港大和快手可灵的研究团队 近期发表的 Context as Memory 论文,可能是目前学术界效果上最接近 Genie 3 的工作,且投稿时间早于 Genie 3 的发布。早在此 前研究中,团队就发现:视频生成模型能够 隐式学习视频数据中的 3D 先验,无需显式 3D 建模辅助 ,这与 Genie 3 的理念不谋而合。如下是一个结果展示: 技术上,团队创新性地提出将 历史生成的上下文作为「记忆」 (即 Context-as-Memory),利用 context learning 技术学习上下 ...
开源版Genie 3世界模型来了:实时+长时间交互,单卡可跑,国内公司出品
机器之心· 2025-08-19 02:43
Core Viewpoint - The article discusses the launch of the open-source interactive world model "Matrix-Game 2.0" by Kunlun Wanwei, which demonstrates significant advancements in real-time interactive generation and simulation of complex environments, rivaling the capabilities of proprietary models like Google DeepMind's Genie 3 [1][3][11]. Group 1: Product Overview - Matrix-Game 2.0 is an open-source model with 1.8 billion parameters, capable of running on a single GPU and achieving a frame rate of 25 FPS for virtual environment generation [12][36]. - The model allows users to upload images and interact with the generated virtual world using keyboard controls, enabling real-time movement and perspective changes [19][40]. - It has been noted for its ability to simulate realistic environments, including complex terrains and dynamic elements, enhancing user immersion [8][21]. Group 2: Technical Innovations - The model employs a novel visual-driven interactive world modeling approach, moving away from traditional language-based prompts to focus on visual understanding and physical law learning [35][40]. - Matrix-Game 2.0 integrates a self-regressive diffusion generation mechanism, which helps in producing longer videos while minimizing content deviation and error accumulation [42][45]. - The data production pipeline utilized for training includes over 1.2 million video clips, achieving an accuracy rate exceeding 99% [37][38]. Group 3: Market Impact and Future Prospects - The emergence of Matrix-Game 2.0 signifies a shift in the world model landscape, indicating that such technologies are moving towards practical applications in various fields, including gaming and robotics [57][59]. - The article highlights the potential of world models to serve as training environments for AI, addressing challenges like data scarcity and generalization in embodied intelligence [57][58]. - Kunlun Wanwei's continuous efforts in open-source projects are expected to accelerate the practical implementation of world models, enhancing their utility across different sectors [54][59].
一张图,开启四维时空:4DNeX让动态世界 「活」起来
机器之心· 2025-08-18 03:22
Core Viewpoint - The article introduces 4DNeX, a groundbreaking framework developed by Nanyang Technological University S-Lab and Shanghai Artificial Intelligence Laboratory, which can generate 4D dynamic scenes from a single input image, marking a significant advancement in the field of AI and world modeling [2][3]. Group 1: Research Background - The concept of world models is gaining traction in AI research, with Google DeepMind's Genie 3 capable of generating interactive videos from high-quality game data, but lacking validation in real-world scenarios [5]. - A pivotal point in the development of world models is the ability to accurately depict dynamic 3D environments that adhere to physical laws, enabling realistic content generation and supporting "counterfactual" reasoning [5][6]. Group 2: 4DNeX-10M Dataset - The 4DNeX-10M dataset consists of nearly 10 million frames of 4D annotated video, covering diverse themes such as indoor and outdoor environments, natural landscapes, and human motion, with a focus on "human-centered" 4D data [10]. - The dataset is constructed using a fully automated data-labeling pipeline, which includes data sourcing from public video libraries and quality control measures to ensure high fidelity [12][14]. Group 3: 4DNeX Method Architecture - 4DNeX proposes a 6D unified representation that captures both appearance (RGB) and geometry (XYZ), allowing for the simultaneous generation of multi-modal content without explicit camera control [16]. - The framework employs a key strategy called "width fusion," which minimizes cross-modal distance by directly concatenating RGB and XYZ data, outperforming other fusion methods [18][20]. Group 4: Experimental Results - Experimental results demonstrate that 4DNeX achieves significant breakthroughs in both efficiency and quality, with a dynamic range of 100% and temporal consistency of 96.8%, surpassing existing methods like Free4D [23]. - User studies indicate that 85% of participants preferred the generated effects of 4DNeX, particularly noting its advantages in motion range and realism [23][25]. - Ablation studies confirmed the critical role of the width fusion strategy in optimizing multi-modal integration, eliminating noise and alignment issues present in other approaches [28].
Video Rebirth刘威:视频生成模型是构建世界模型的最佳路径
IPO早知道· 2025-08-18 02:31
Core Viewpoint - Video Rebirth defines the video-native world model as a combination of a world simulator and a world predictor, positioning video generation models as the optimal path for constructing world models, which may represent a critical breakthrough in AI's transition from perception to cognition [2][4]. Group 1: Technological Framework - The world model should possess three core capabilities: simulation for emulation functions, prediction for causal reasoning, and exploration for planning and decision-making. Simulation corresponds to fast thinking, prediction to slow thinking, and exploration to active thinking, which are essential for the world model [3]. - Current multi-modal models like GPT-4o can handle various inputs and outputs but remain in a passive response mode, lacking comprehensive environmental modeling and predictive capabilities. The world model aims to shift from passive to active thinking, enabling proactive series thinking [3]. Group 2: Innovations and Future Directions - The emergence of SORA has provided significant insights for the world model, demonstrating its feasibility through video generation and achieving high levels of spatiotemporal simulation. Although the current version has limitations, it offers a practical technical starting point for constructing the world model [3]. - Video Rebirth aims to address key issues in the mainstream DiT architecture, such as the lack of causal reasoning and inability to interactively intervene, by developing unique technical propositions and model paradigms, potentially leading to a "ChatGPT moment" in the video generation field [4]. - The company emphasizes that AI needs not only grand narratives but also the creation of realistic scenarios. By leveraging video generation to approach world modeling, Video Rebirth seeks to achieve significant technological innovation during a critical period for breakthroughs in AI cognitive capabilities [4].
扩散世界模型LaDi-WM大幅提升机器人操作的成功率和跨场景泛化能力
具身智能之心· 2025-08-18 00:07
Core Viewpoint - The article discusses the development of LaDi-WM (Latent Diffusion-based World Models), a novel world model that enhances robotic operation performance through predictive strategies, addressing the challenge of accurately predicting future states in robot-object interactions [1][5][28]. Group 1: LaDi-WM Overview - LaDi-WM utilizes pre-trained vision foundation models to create latent space representations that encompass both geometric and semantic features, facilitating strategy learning and cross-task generalization in robotic operations [1][5][10]. - The framework consists of two main phases: world model learning and policy learning, which iteratively optimizes action outputs based on predicted future states [9][12]. Group 2: Methodology - The world model learning phase involves extracting geometric representations using DINOv2 and semantic representations using Siglip, followed by an interactive diffusion process to enhance dynamic prediction accuracy [10][12]. - The policy model training incorporates future predictions from the world model as additional inputs, guiding the model to improve action predictions and reduce output distribution entropy over iterations [12][22]. Group 3: Experimental Results - In virtual experiments on the LIBERO-LONG dataset, LaDi-WM achieved a success rate of 68.7% with only 10 training trajectories, outperforming previous methods by a significant margin [15][16]. - The framework demonstrated strong performance in the CALVIN D-D dataset, completing tasks with an average length of 3.63, indicating robust capabilities in long-duration tasks [17][21]. - Real-world experiments showed a 20% increase in success rates for tasks such as stacking bowls and drawer operations, validating the effectiveness of LaDi-WM in practical scenarios [25][26]. Group 4: Scalability and Generalization - The scalability experiments indicated that increasing the training data for the world model led to reduced prediction errors and improved policy performance [18][22]. - The generalization capability of the world model was highlighted by its ability to guide policy learning across different environments, achieving better performance than models trained solely in the target environment [20][21].