Workflow
世界模型
icon
Search documents
在具身智能的岔路口,这场论坛把数据、模型、Infra聊透了
机器之心· 2025-09-29 02:52
机器之心原创 作者:张倩 当机器人成为各大科技展会最受瞩目的焦点,当具身智能论坛场场爆满、一票难求,我们不难发现:这个领域正在经历前所未有的关注热潮。 然而,热潮之下,仍有诸多关键议题悬而未决:面对 数据 稀缺,有人寄希望于合成数据的突破,有人坚持真机数据才是根本;在 技术路线 之争 中,有人押注端 到端的整体范式,有人则认为分层架构更符合演进规律;至于 模型 形态,有人视 VLA 为智能的最终归宿,也有人认为世界模型才是真正的未来。 现阶段出现这种分歧非常正常,因为整个行业的发展路径尚未收敛。有些问题甚至还没有来得及系统讨论,比如量产之后会出现哪些新的卡点,谁来解决? 正是因为存在这些问题,业界迫切需要一个开放的对话平台。在 今年 云 栖大会的 具身智能论坛 上,我们见证了这样一场深度交锋:不同派系的代表坐到同一张 桌子前,将技术分歧、商业思考和基础设施需求一并摊开讨论,试图在碰撞中寻找新的共识。 论坛过后,我们也和这场论坛的发起者 —— 阿里云 聊了聊。这家云计算巨头选择在此时深度介入具身智能领域,本身就值得关注。 聊完之后,我们发现,他们真正的入局其实是在四五年前,如今更是在提前为具身智能行业即将到来的 ...
大神爆肝一个月,复刻DeepMind世界模型,300万参数就能玩实时交互像素游戏
3 6 Ke· 2025-09-28 10:51
Core Insights - The article discusses the development of TinyWorlds, a world model created by the X blogger anandmaj, which replicates the core ideas of DeepMind's Genie 3 with only 3 million parameters, capable of generating playable pixel-style environments in real-time [1][6]. Group 1: Understanding World Models - World models are a type of neural network that simulate the physical world by generating videos, showcasing emergent capabilities similar to those found in large language models (LLMs) [2][6]. - DeepMind's Genie 3 demonstrated that training on large-scale video data allows for the emergence of advanced behaviors without the need for action-labeled data [2][6]. Group 2: Dataset Construction - TinyWorlds' dataset consists of processed YouTube gaming videos, including titles like Pong, Sonic, Zelda, Pole Position, and Doom, which define the environments the model can generate [7]. Group 3: Model Architecture - The core of TinyWorlds is a Space-time Transformer that captures video information through spatial attention, temporal attention, and a feedforward network [10]. - The model employs an action tokenizer to automatically generate frame-to-frame action labels, enabling training on unlabeled data [18]. Group 4: Training Dynamics - The dynamics model serves as the "brain" of the system, combining video and action inputs to predict future frames, with initial performance limitations addressed by scaling the model [21]. - The introduction of masked frames and variance loss during training helps the model better utilize action signals [20]. Group 5: Performance and Future Prospects - Despite having only 3 million parameters, TinyWorlds can generate interactive pixel-style worlds, although the output remains somewhat blurry and incoherent [23][24]. - The author suggests that scaling the model to hundreds of billions of parameters and incorporating diffusion methods could significantly enhance the quality of generated content [24].
大神爆肝一个月,复刻DeepMind世界模型,300万参数就能玩实时交互像素游戏
机器之心· 2025-09-28 10:29
Core Insights - The article discusses the development of TinyWorlds, a minimal world model inspired by DeepMind's Genie 3, capable of generating playable pixel-style environments with only 3 million parameters [1][9][32]. Group 1: Understanding World Models - World models are a type of neural network that simulate the physical world by generating videos, showcasing emergent capabilities when trained on large-scale video data [5][7]. - The challenge lies in the need for frame-by-frame action labels for training, which limits the use of unannotated video data from the internet [5][6]. - Genie 1's solution involved training an action tokenizer to infer action labels, enabling the use of vast amounts of unannotated video for training [5][6]. Group 2: Dataset Construction - TinyWorlds' dataset consists of processed YouTube gaming videos, determining the range of environments the model can generate [11][12]. Group 3: Architecture and Tokenization Strategy - TinyWorlds employs a space-time transformer to handle three-dimensional video data, capturing video information through a three-layer mechanism [15][17]. - The model's architecture includes spatial attention, temporal attention, and a feedforward network to extract higher-level features [21][22]. - The video tokenizer compresses videos into tokens, while the action tokenizer predicts actions between frames, allowing training on unannotated data [24][26]. Group 4: Training the World Generator - The dynamics model serves as the system's "brain," predicting future frames based on video and actions, with performance improving significantly when the model size is increased [30][32]. - Despite its 3 million parameters, TinyWorlds can generate interactive pixel-style worlds, though the output remains somewhat blurry and incoherent [32].
Meta押注“安卓式”机器人平台:数十亿美元打造通用软件
Huan Qiu Wang Zi Xun· 2025-09-28 04:24
Group 1 - Meta's CTO Andrew Bosworth announced that humanoid robots have been elevated to a strategic priority level on par with augmented reality (AR) [1] - The company plans to invest "tens of billions" in developing a universal software platform for humanoid robots, aiming to become the "Android" of the robotics industry [1][2] - Meta does not intend to mass-produce hardware but will follow Google's open approach in the smartphone sector, allowing any compliant robot body to run Meta's operating system [2] Group 2 - Bosworth highlighted that the main challenge lies in software rather than hardware, as current humanoid robots struggle with dexterous manipulation despite being able to run and perform flips [2] - To address the challenges of fine motor skills, Meta established a "Super Intelligent AI Lab" earlier this year to create a "world model" that simulates real physical laws [2] - This model aims to provide robots with spatial awareness, force control prediction, and real-time decision-making capabilities, compensating for the limitations of traditional sensor feedback systems [2]
Meta CTO:人形机器人是下一个“AR级赌注” 瓶颈在于软件
Xin Lang Cai Jing· 2025-09-27 06:46
Meta首席技术官Andrew Bosworth透露,今年早些时候他在扎克伯格的指导下启动了一项机器人研究计 划,"硬件不是瓶颈,瓶颈在于软件。"Meta希望开发一个"世界模型",帮助机器人"进行软件模拟,从 而实现灵巧的手臂动作",未来可能扩展到更复杂的动作和任务。 ...
2025人工智能产业十大关键词
机器人圈· 2025-09-26 09:29
Core Insights - The 2025 Artificial Intelligence Industry Conference highlighted ten key trends in AI, emphasizing the convergence of technology, applications, and ecosystems, leading to a clearer vision of a smart-native world [1]. Group 1: Foundation Super Models - In 2025, foundational models and reasoning models are advancing simultaneously, with a comprehensive capability increase of over 30% from late 2024 to August 2025 [3][4]. - Key features of leading large models include the integration of thinking and non-thinking modes, enhanced understanding and reasoning abilities, and built-in agent capabilities for real-world applications [4][6]. - The emergence of foundational super models simplifies user interaction, enhances workflow precision, and raises new data supply requirements [6]. Group 2: Autonomous Intelligent Agents - Highly encapsulated intelligent agent products are unlocking the potential of large models, showing better performance in complex tasks compared to single models [9][10]. - Current intelligent agents still have significant room for improvement, particularly in long-duration task execution and interconnectivity [12]. Group 3: Embodied Intelligence - Embodied intelligence is transitioning from laboratory settings to real-world applications, with models being deployed in practical scenarios [15][16]. - Challenges remain in data quality, model generalization, and soft-hard coordination for effective task execution [18]. Group 4: World Models - World models are emerging as a core pathway to general artificial intelligence (AGI), focusing on capabilities like data generation, action interpretation, environment interaction, and scene reconstruction [21][22]. - The development of world models faces challenges such as unclear definitions, diverse technical routes, and limited application scope [22]. Group 5: AI Reshaping Software - AI is transforming the software development lifecycle, with significant increases in token usage for programming tasks and the introduction of advanced AI tools [25][28]. - The role of software developers is evolving into more complex roles, leading to the emergence of "super individuals" [28]. Group 6: Open Intelligent Computing Ecosystem - The intelligent computing landscape is shifting towards an open-source model, fostering collaboration and innovation across various sectors [30][32]. - The synergy between software and hardware is improving, with domestic hardware achieving performance parity with leading systems [30]. Group 7: High-Quality Industry Data Sets - The focus of AI data set construction is shifting from general-purpose to high-quality industry-specific data sets, addressing critical quality issues [35][38]. - New data supply chains are needed to support advanced technologies like reinforcement learning and world models [38]. Group 8: Open Source as Standard - Open-source initiatives are reshaping the AI landscape, with significant adoption of domestic open-source models and a growing number of active developers [40][42]. - The business model is evolving towards "open-source free + high-level service charges," promoting cloud services and chip demand [42]. Group 9: Mitigating Model Hallucinations - The issue of hallucinations in large models is becoming a significant barrier to application, with ongoing research into mitigation strategies [44][46]. - Various approaches are being explored to enhance data quality, model training, and user-side testing to reduce hallucination rates [46]. Group 10: AI as an International Public Good - Global AI development is uneven, necessitating international cooperation to promote equitable access to AI technologies [49][51]. - Strategies are being implemented to address challenges in cross-border compliance and data flow, aiming to make AI a truly shared international public good [51].
把“会跑的代码世界”装进AI,Meta重磅开源首个代码世界模型:让AI像程序员一样思考
3 6 Ke· 2025-09-25 13:02
Core Insights - Meta's FAIR team has launched the Code World Model (CWM), a large language model (LLM) with 32 billion parameters and a context length of up to 131k tokens, aimed at integrating "world model" concepts into code generation and reasoning [1][2][3] - CWM is designed to not only write code but also simulate code execution, reason about program states, and self-detect and fix bugs, enhancing the model's understanding of code execution [2][3] Training Phases - The training of CWM is divided into three main phases: - Pre-training with 8 trillion tokens, where approximately 30% are code-related [3][4] - Mid-training, which incorporates 5 trillion tokens of world modeling data, extending the context length to 131k tokens [4][6] - Post-training (SFT + RL), involving 100 billion tokens for instruction and reasoning capabilities, followed by large-scale multi-task reinforcement learning with 172 billion tokens [4][10] Data Utilization - CWM's world model capabilities are driven by two main types of data during mid-training: - Execution traces from Python, which help the model learn how code execution alters local states [6][8] - Interaction trajectories from an automated agent that executes tasks in a repository, collecting around 3 million trajectories from 10.2k images and 3.15k repositories [9] Performance Metrics - In benchmark tests, CWM demonstrated strong performance, achieving 65.8% pass@1 on SWE-bench Verified with Test-Time-Scaling enabled, and notable results on LiveCodeBench (68.6%), Math-500 (96.6%), and AIME 2024 (76.0%) [10][12] - CWM's performance is competitive with larger or closed-source LLMs, nearing GPT-4 levels, although it has limitations in certain editing formats and multi-language scenarios [12] Industry Reception - The release of CWM has garnered significant attention, with Meta's AI researchers actively promoting it, highlighting its potential impact on software development [13][15] - While the open-sourcing of CWM's training checkpoints is praised for its utility in academic and engineering replication, there are concerns regarding the model's computational demands and the need for practical testing in real development environments [15]
代码生成要变天了?被质疑架空后,Yann LeCun携320亿参数开源世界模型“杀回来了”
AI前线· 2025-09-25 08:04
Core Viewpoint - The article discusses the release of the Code World Model (CWM) by Meta, which aims to enhance code generation capabilities by integrating a deeper understanding of code execution, addressing the limitations of previous models that could generate syntactically correct code but failed in execution [4][10]. Group 1: Model Overview - CWM is the first open-source code world model with 32 billion parameters, designed to advance code generation research based on world models [4][5]. - Unlike traditional models that rely on static code training, CWM incorporates dynamic interaction data from Python interpreters and Docker environments to improve its understanding and reasoning about code [7][14]. - The model can simulate the step-by-step execution of code, understanding how variables change and what feedback the program receives [7][10]. Group 2: Performance Metrics - CWM achieved a score of 65.8% on the SWE-bench Verified task, outperforming all other open-source models of similar size and nearing GPT-4 levels [8]. - It scored 68.6% on LiveCodeBench, 96.6% on Math-500, and 76.0% on AIME 2024, showcasing its strong performance across various benchmarks [8]. Group 3: Training Methodology - The training of CWM involved three key phases: pre-training, mid-training, and post-training, utilizing supervised fine-tuning (SFT) and reinforcement learning (RL) [15][16]. - The model was pre-trained on 8 trillion tokens, followed by mid-training on code world modeling data with an additional 5 trillion tokens, enhancing its contextual understanding [15][16]. Group 4: Industry Context and Implications - The release of CWM marks a significant step in Meta's AI strategy, especially following the restructuring of its AI business [5][23]. - The model's development reflects a shift towards balancing open-source initiatives with commercial interests, as Meta navigates its AI strategy amidst organizational changes [26].
汽车业AI“狂飙”,“轮式智能生命体”即将到来
Hua Xia Shi Bao· 2025-09-25 07:58
也许在不远的将来,汽车不再仅仅只是响应用户的指令,而可以主动与用户交谈,为用户分忧解难,甚 至在用户手握方向盘时,已经规划好最合适的路线、调节好最适宜的车内温度。这并非遥不可及的科幻 场景,而是一幅正由人工智能精心绘制的现实图景。 海尔集团董事、汽车之家董事会主席兼首席执行官刘斥表示:"当前汽车行业面临着技术路线快速演进 与产业格局深刻重塑的双重变革,机遇空前,挑战亦不容小觑。" 在这场以"Hi·Future"为主题的科技盛宴中,记者感受最强烈的就是,关于汽车行业的讨论焦点已经彻底 转变。 中国国际贸易促进委员会汽车行业分会会长王侠认为,汽车行业应跳出硬件参数与价格战的"内卷"漩 涡,而去关注一个更宏大的命题:如何让汽车从一台冰冷的机器,进化成为一个能思考、会学习、懂合 作的"轮式智能生命体"。 王侠认为,未来的汽车将不再是信息孤岛,而是智慧城市交通网络中的一个活跃节点,它能与道路、云 端、其他车辆实时"对话",共同编织一张安全、高效、绿色的出行网络。王侠举例表示:你的车可以提 本报(chinatimes.net.cn)记者刘凯 于建平 北京报道 前接收到前方路口红绿灯的配时信息,自动平滑车速,实现"绿波通行 ...
周鸿祎:语言是最重要的,语言掌握了就一通百通
Xin Lang Ke Ji· 2025-09-24 05:09
Core Insights - The discussion between Luo Yonghao and Zhou Hongyi emphasizes the importance of language in understanding and developing world models in artificial intelligence [1] - Zhou Hongyi critiques the focus on world models by figures like Yang Lequn from Meta and Li Feifei, arguing that the key to progress in AI lies in comprehending language [1] - The recent launch of Google's product "nano banana" showcases advancements in understanding graphics that surpass mere visual perception, integrating extensive knowledge [1] Summary by Categories Language and AI Development - Zhou Hongyi asserts that language is crucial for communication, knowledge transfer, logical reasoning, and world description, which are essential for creating effective world models [1] - The lack of progress in AI is attributed to a failure to grasp the significance of language, which serves as a key to understanding human knowledge and reasoning [1] Technological Advancements - The introduction of Google's "nano banana" product is highlighted as a significant breakthrough, demonstrating enhanced graphic understanding that integrates knowledge beyond visual capabilities [1] - The advancements in various models, including music, video, and visual models, are linked to breakthroughs in language comprehension [1]