Workflow
World Model
icon
Search documents
首个代码世界模型引爆AI圈,能让智能体学会「真推理」,Meta开源
具身智能之心· 2025-09-26 00:04
编辑丨 机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 大模型的架构,要彻底进化了? 昨晚开始,AI 圈都在研究一个神奇的新物种 ——Code World Model(CWM)。 Meta 重组后的 AI 部门推出的首个重磅研究,是一个 世界模型 ,用来写代码的。 为了提升代码理解能力,而不仅仅局限于从静态代码训练中学习,Meta FAIR CodeGen 团队在 Python 解释器和智能体式 Docker 环境中使用了大量观测 - 动作轨迹 进行中间训练(mid-train),并在可验证编码、数学和多轮软件工程环境中进行了大规模多任务推理强化学习(RL)。 为支持进一步的代码世界建模研究,Meta 开放了模型在 中间训练(mid-training)、SFT 和 RL 阶段的检查点。 它和「传统」的大语言模型(LLM)思路不同,理论是这样的: 当人类进行计划时,我们会在脑海中想象不同行动可能带来的结果。当我们推理代码时,我们会在心中模拟其部 ...
人形机器人考察要点_市场展望、组件与具身人工智能-Humanoid Robot tour takeaways_ market outlook, components and embodied AI
2025-09-18 13:09
Summary of Conference Call Notes on Greater China Industrials (Humanoid Robots and Autonomous Driving) Industry Overview - The humanoid robot and autonomous driving (AD) sectors in China are expected to experience rapid expansion over the next decade, with significant growth anticipated in factory settings within 2-3 years and further opportunities in commercial and household applications in the long term [1][1] - The current bill of materials (BOM) cost for a fully-functional humanoid robot is approximately US$50-60k, with expectations for rapid cost reductions in the next five years due to improved product design and economies of scale [1][1] - Stricter regulations in the AD sector are anticipated to create more opportunities for AD components, particularly for LiDAR technology, which will benefit from new long-distance object detection requirements [1][1] Key Players and Developments Dobot - Dobot is a leading global collaborative robot (COBOT) brand, achieving a 47% year-over-year growth in 6-axis COBOT sales in the first half of 2025, indicating market share gains [8][8] - The company has entered the humanoid robot market, launching its first prototype in early 2025 and planning deployment in manufacturing and business scenarios [9][9] RoboSense - RoboSense is focusing on its new EMX LiDAR products, which offer superior precision and detection distance compared to competitors, with expectations to ship 600-700k units in 2025 and 1.5 million units in 2026 [10][10] - The company is also exploring opportunities in the lawn mower, unmanned delivery, and robotaxi industries, with significant partnerships established [11][11] Zhaowei Machinery & Electronics - Zhaowei has launched new dexterous hand models for humanoid robots and aims for a 10-15% global market share in this segment [12][12][13][13] - The BOM cost of the dexterous hand is estimated to account for 20-30% of the total BOM cost of a humanoid robot [13][13] Googol Technology - Googol Technology specializes in high-end control systems for advanced manufacturing and sees strong growth potential in humanoid robots due to its expertise in multi-degree-of-freedom (DoF) controlling [14][15] Minieye - Minieye is making progress with its smart driving solutions, including iPilot and iRobo, and anticipates significant growth in the penetration of front-view camera modules and driver monitoring systems due to new safety regulations [16][17] Leju Robotics - Leju targets to deliver over 1,000 units of robotics in 2025, focusing on stability and durability for large-scale applications [18][18] Orbbec - Orbbec is a leading player in robot vision systems, holding over 70% market share in 3D vision systems for service robots in China [21][21][22][22] UBTECH - UBTECH aims to ship 500 humanoid robots in 2025 and 2,000-3,000 units in 2026, with expectations for BOM cost reductions in the coming years [23][23][24][24] LK Tech - LK Tech is focusing on magnesium alloy technology for humanoid robots, which offers lightweighting and other advantages, and has signed cooperation agreements for R&D projects [25][26][26] Technology Insights - The competition between VLA (Vision-Language-Action) and world model technologies for embodied AI is highlighted, with data availability being a key bottleneck [3][3] - The vision system of humanoid robots is evolving, with depth cameras becoming the mainstream choice for enhancing sensing and navigation capabilities [22][22] Market Outlook - The humanoid robot market is expected to grow significantly, with projections of 3 million units shipped by 2030, leading to substantial opportunities for component suppliers [13][13] - The average selling price (ASP) of humanoid robots is expected to decline to approximately RMB150k (~US$20k) by 2026-2028 due to scale effects [20][20] Conclusion - The humanoid robot and AD sectors in Greater China are poised for significant growth, driven by technological advancements, regulatory changes, and increasing market demand. Key players are actively innovating and expanding their product offerings to capture market share in this rapidly evolving landscape.
X @Demis Hassabis
Demis Hassabis· 2025-08-24 02:15
AI Development & Innovation - AI can now be trained within another AI, indicating a significant advancement in AI training methodologies [1] - The world model, Genie 3, can imagine and generate new worlds dynamically, showcasing its advanced simulation capabilities [1] - An embodied agent, Sima, can autonomously navigate these AI-generated environments, demonstrating progress in embodied intelligence [1] - The entire environment-to-action loop is now generated by AI, highlighting the potential for fully AI-driven training simulations [1] - The industry anticipates the development of world simulators for training general embodied intelligence, suggesting future research directions [1]
又有很多自动驾驶工作中稿了ICCV 2025,我们发现了一些新趋势的变化...
自动驾驶之心· 2025-08-16 00:03
Core Insights - The article discusses the latest trends and research directions in the field of autonomous driving, highlighting the integration of multimodal large models and vision-language action generation as key areas of focus for both academia and industry [2][5]. Group 1: Research Directions - The research community is concentrating on several key areas, including the combination of MoE (Mixture of Experts) with autonomous driving, benchmark development for autonomous driving, and trajectory generation using diffusion models [2]. - The closed-loop simulation and world models are emerging as critical needs in autonomous driving, driven by the limitations of real-world open-loop testing. This approach aims to reduce costs and improve model iteration efficiency [5]. - There is a notable emphasis on performance improvement in object detection and OCC (Occupancy Classification and Counting), with many ongoing projects exploring specific pain points and challenges in these areas [5]. Group 2: Notable Projects and Publications - "ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation" is a significant project from Huazhong University of Science and Technology and Xiaomi, focusing on integrating vision and language for action generation in autonomous driving [5]. - "All-in-One Large Multimodal Model for Autonomous Driving" is another important work from Zhongshan University and Meituan, contributing to the development of comprehensive models for autonomous driving [6]. - "MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding" from Chongqing University aims to enhance understanding of driving scenarios through multimodal analysis [8]. Group 3: Simulation and Reconstruction - The project "Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images" from TUM focuses on advanced reconstruction techniques for autonomous driving [14]. - "CoDa-4DGS: Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving" from Fraunhofer IVI and TU Munich is another notable work that addresses dynamic scene reconstruction [16]. Group 4: Trajectory Prediction and World Models - "Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics" from Hong Kong University of Science and Technology and Didi emphasizes the importance of trajectory prediction in autonomous driving [29]. - "World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model" from the Chinese Academy of Sciences focuses on developing a comprehensive world model for autonomous driving [32].
DeepMind科学家揭秘Genie 3:自回归架构如何让AI建构整个世界 | Jinqiu Select
锦秋集· 2025-08-06 09:07
Core Viewpoint - Google DeepMind has introduced Genie 3, a revolutionary general world model capable of generating highly interactive 3D environments from text prompts or images, supporting real-time interaction and dynamic modifications [1][2]. Group 1: Breakthrough Technology - Genie 3 is described as a "paradigm-shifting" AI technology that could unlock a trillion-dollar commercial landscape and potentially become a "killer application" in the virtual reality (VR) sector [9]. - The technology integrates features of traditional game engines, physics simulators, and video generation models, creating a real-time interactive world model [9]. Group 2: Evolution of World Models - The construction of virtual worlds has evolved from manual coding methods, exemplified by the 1996 Quake engine, to AI-generated models that learn from vast amounts of real-world video data [10]. - The ultimate goal is to generate any desired interactive world from a simple text prompt, providing diverse environments for AI training [10]. Group 3: Genie Iteration Journey - The initial version of Genie was trained on 30,000 hours of 2D platform game footage, demonstrating an early understanding of the physical world [11]. - Genie 2 achieved a leap to 3D with near real-time performance and improved visual fidelity, simulating real-world lighting effects [12]. - Genie 3 further enhances this technology with a resolution of 720p, enabling immersive experiences and real-time interaction [13]. Group 4: Key Features - Genie 3 shifts input from images to text prompts, allowing for greater creative flexibility [15]. - It supports diverse environments, long-term interactions, and prompt-controlled world events, crucial for simulating rare occurrences in scenarios like autonomous driving [15]. Group 5: Technical Insights - Genie 3 maintains world consistency through an emergent property of its architecture, generating frames while referencing previous events [16]. - This causal generation method aligns with real-world time flow, enhancing the model's ability to simulate complex environments [16]. Group 6: Applications and Future Implications - Genie 3 is positioned as a platform for training embodied agents, potentially leading to groundbreaking strategies in AI development [17]. - It allows for low-cost, safe simulations of various scenarios, addressing the scarcity of real-world data for training [17]. Group 7: Creativity and Human Collaboration - DeepMind scientists argue that Genie 3's reliance on high-quality prompts enhances human creativity, providing a powerful tool for creators [19]. - This technology may herald a new form of interactive entertainment, enabling users to collaboratively create and explore interconnected virtual worlds [19]. Group 8: Limitations and Challenges - Genie 3 is still a research prototype with limitations, such as supporting only single-agent experiences and facing reliability issues [20]. - There exists a cognitive gap in fully simulating human experiences beyond visual and auditory senses [20]. Group 9: Technical Specifications and Industry Impact - Genie 3 operates on Google's TPU network, indicating significant computational demands, with training data likely sourced from extensive video content [21]. - The technology is expected to greatly impact the creative industry by simplifying the production of interactive graphics, while not simply replacing traditional game engines [22]. Group 10: Closing Remarks - Genie 3 represents a significant advancement in realistic world simulation, potentially bridging the long-standing "sim-to-real" gap in AI applications [23].
深夜,OpenAI、谷歌等更新多款模型
第一财经· 2025-08-06 07:17
Core Insights - The article discusses the recent product launches by major AI model companies, highlighting shifts in product strategies and advancements in AI capabilities [3][11]. Group 1: OpenAI Developments - OpenAI has released two new open-source models, gpt-oss-120b with 117 billion parameters and gpt-oss-20b with 21 billion parameters, both utilizing the MoE architecture [4][5]. - The gpt-oss-120b model can run on a single 80GB GPU, while gpt-oss-20b can operate on consumer devices with 16GB memory, allowing for local deployment on laptops and smartphones [5][6]. - OpenAI's new models have shown competitive performance in benchmark tests, with gpt-oss-120b scoring close to or exceeding the closed-source o4-mini model [5][6]. Group 2: Anthropic's Strategy - Anthropic has shifted to a strategy of more frequent incremental updates, exemplified by the release of Claude Opus 4.1, which improves upon its predecessor in areas like coding and data analysis [6][7]. - In benchmark tests, Claude Opus 4.1 scored 74.5%, surpassing Opus 4's 72.5%, indicating enhanced coding capabilities [7]. Group 3: Google's Innovations - Google introduced Genie 3, its first world model that supports real-time interaction, building on previous models like Genie 1 and 2 [8][9]. - Genie 3 can simulate complex environments and interactions, generating consistent visuals for several minutes, a significant improvement over Genie 2 [9][11]. - Despite its advancements, Genie 3 still faces limitations, such as restricted action spaces and challenges in simulating multiple agents in shared environments [11].
X @Demis Hassabis
Demis Hassabis· 2025-08-05 15:21
Technology & Innovation - Google DeepMind introduces Genie 3, a groundbreaking world model for creating interactive environments from text prompts [1] - Genie 3 enables the generation of playable environments from a single text prompt [1] - The technology allows for the creation of diverse environments, ranging from photorealistic landscapes to fantasy realms [1] Potential Applications - The generated videos are not just for viewing but can be explored interactively [1] - The possibilities for interactive and playable environments are described as endless [1]
Google Genie 3 - The Most Advanced World Simulator Ever...
Matthew Berman· 2025-08-05 14:02
Model Overview - Google announced Genie 3, a general-purpose world model for generating diverse interactive environments [1][8] - Genie 3 allows real-time interaction with improved consistency and realism compared to Genie 2 [12] - The model generates 720p high-quality environments [3] Technical Aspects - Genie 3 considers the entire previously generated trajectory, not just the previous frame, for autoregressive generation [15] - Consistency in Genie 3 is an emergent capability resulting from training scale, not pre-programming [19] - Genie 3 generates dynamic and rich worlds frame by frame based on world description and user actions, unlike methods relying on explicit 3D representation [20] Potential Applications - World models like Genie 3 can be used for training robots and agents [9] - The technology has potential applications in creating video games, movies, and television shows [9] - Google positions world models as a key step towards AGI by providing AI agents with unlimited simulation environments for training [9][10] Comparison with Previous Models - Genie 3 demonstrates significant improvements in consistency, detail, and generation length compared to Genie 2 [22][23] - Genie 3 allows for deeper world exploration than Genie 2 [23] Interactive Features - Users can prompt events in real-time, adding elements to the scene [21] - The model demonstrates realistic interactions, such as light moving out of the way of a jet ski and reflections in mirrors [6] - The model can simulate actions like painting, with paint only being applied when the brush touches the wall [29][30]
CAAI具身智能专委会主任蒋树强:世界模型是智能体进行决策的重要依据
机器人圈· 2025-08-04 11:38
Core Viewpoint - The core discussion revolves around the concept of embodied intelligence, emphasizing the intricate relationship between body, environment, and intelligence, and how these elements collectively contribute to the realization of intelligent systems [4]. Group 1: Embodied Intelligence - Embodied intelligence is defined by three key elements: body, environment, and intelligence, which interact in complex ways to enable intelligent behavior [4]. - The structure and sensory capabilities of the body significantly influence how an intelligent agent perceives and interacts with the world, highlighting the importance of physical attributes such as height and limb structure [4]. Group 2: Large Models in Embodied Intelligence - The training of embodied large models requires the integration of visual, linguistic, and behavioral data, necessitating a unified approach to data, computing power, and algorithms [4]. - The complexity of data in training embodied large models is heightened as it must encompass multimodal information, including behavior, physical parameters, and tactile data [4]. - Challenges remain in the generalization capabilities of embodied large models in real physical spaces, particularly concerning data complexity and sensor differences [4]. Group 3: World Models - World models serve as abstract representations of the real world, encompassing three-dimensional space, dynamic changes, object relationships, and memory, which are crucial for understanding and predicting environmental states [5]. - The relationship between world models and large models, as well as their connection to three-dimensional spaces, presents areas for further exploration [5]. - Current research often relies on simulators to generate data, but aligning virtual environments with real-world physical parameters remains a significant challenge [5].
Meta chief AI scientist Yann LeCun clarifies his role after the company hires another chief AI scientist
Business Insider· 2025-07-26 19:50
Core Insights - Meta has appointed Shengjia Zhao, co-creator of ChatGPT and former lead scientist at OpenAI, as the chief scientist at its Superintelligence Labs, indicating a strategic move in the AI talent acquisition landscape [1][2]. Group 1: Leadership and Structure - Shengjia Zhao will set the research agenda and scientific direction for Meta's Superintelligence Labs, working closely with CEO Mark Zuckerberg and Chief AI Officer Alexandr Wang [2]. - The formalization of Zhao's leadership role comes as Meta reports successful recruitment efforts and team assembly [2]. - Yann LeCun, who has been with Meta since 2013 and serves as the chief AI scientist for Meta's Fundamental AI Research (FAIR), clarified that his role remains unchanged despite Zhao's appointment [3]. Group 2: Research Focus - Meta's FAIR, established over a decade ago, focuses on advancing AI technology, leading to the release of the open-source large language model, Llama, in 2023 [8]. - The Superintelligence Labs will encompass FAIR and other teams, aiming to develop "personal superintelligence for everyone," as stated by Zuckerberg [9]. - LeCun is currently focused on developing a new model type, known as a world model, which could potentially replace large language models [8]. Group 3: Collaboration and Future Directions - Zhao's expertise in pioneering new scaling paradigms in AI research is expected to guide the scientific direction of Meta's AI initiatives [10]. - LeCun expressed enthusiasm about collaborating with Zhao to enhance the integration of new research into Meta's advanced models [10].