Workflow
World Model
icon
Search documents
理想詹锟ICCV'25讲世界模型从数据闭环到训练闭环PPT
理想TOP2· 2025-10-28 15:18
Core Insights - The article discusses the evolution of autonomous driving technology, emphasizing the transition from data closed-loop systems to training closed-loop systems, which focus on real-world utility and evaluation of progress [13][14]. Group 1: Data and Infrastructure - The company has accumulated 1.5 billion kilometers of driving data, which is crucial for training autonomous systems [8]. - A closed-loop data system is in place, utilizing over 200 trigger data points for training datasets, with clips ranging from 15 to 45 seconds [8]. - The data scaling law indicates a significant increase in the number of clips used for training, with projections showing up to 600 million clips by 2025 [10]. Group 2: Technology Stack - The key technology stack for autonomous driving includes regional-scale simulation, synthetic data, reinforcement learning, and multimodal generation [18]. - The focus is on enhancing simulation quality through advanced techniques like scene reconstruction and traffic agent modeling [18][19]. - The transition from reconstruction to generation in simulation is highlighted, utilizing diffusion models for improved scene generation [19]. Group 3: Training and Evaluation - The article emphasizes the importance of building a training closed-loop that integrates various models, including VLA (Vision-Language Alignment) and reinforcement learning [15]. - The evaluation environment and reward systems are critical for assessing the performance of autonomous driving systems [14][35]. - Interactive agents are identified as a key challenge in the training closed-loop, necessitating accurate feedback and generalization ability [38][40]. Group 4: Future Directions - The company is working on various projects aimed at enhancing both reconstruction and generation capabilities, with milestones set for 2024 and 2025 [21][24]. - The application of generated data includes scene editing, scene transfer, and scene generation, which are essential for improving the realism of simulations [27][33].
自动驾驶论文速递!VLA、世界模型、强化学习、轨迹规划等......
自动驾驶之心· 2025-10-18 04:00
Core Insights - The article discusses advancements in autonomous driving technologies, highlighting various research contributions and their implications for the industry. Group 1: DriveVLA-W0 - The DriveVLA-W0 training paradigm enhances the generalization ability and data scalability of VLA models by using world modeling to predict future images, achieving 93.0 PDMS and 86.1 EPDMS on NAVSIM benchmarks [6][12] - A lightweight Mixture-of-Experts (MoE) architecture reduces inference latency to 63.1% of the baseline VLA, meeting real-time deployment needs [6][12] - The data scaling law amplification effect is validated, showing significant performance improvements as data volume increases, with a 28.8% reduction in ADE and a 15.9% decrease in collision rates when using 70M frames [6][12] Group 2: CoIRL-AD - The CoIRL-AD framework combines imitation learning and reinforcement learning within a latent world model, achieving an 18% reduction in collision rates on the nuScenes dataset and a PDMS score of 88.2 on the Navsim benchmark [13][16] - The framework integrates RL into an end-to-end autonomous driving model, addressing offline RL's scene expansion issues [13][16] - A decoupled dual-policy architecture facilitates structured interaction between imitation learning and reinforcement learning, enhancing knowledge transfer [13][16] Group 3: PAGS - The Priority-Adaptive Gaussian Splatting (PAGS) framework achieves high-quality real-time 3D reconstruction in dynamic driving scenarios, with a PSNR of 34.63 and SSIM of 0.933 on the Waymo dataset [23][29] - PAGS incorporates semantic-guided pruning and regularization to balance reconstruction fidelity and computational cost [23][29] - The framework demonstrates a rendering speed of 353 FPS with a training time of only 1 hour and 22 minutes, outperforming existing methods [23][29] Group 4: Flow Planner - The Flow Planner achieves a score of 90.43 on the nuPlan Val14 benchmark, marking the first learning-based method to surpass 90 without prior knowledge [34][40] - It introduces fine-grained trajectory tokenization to enhance local feature extraction while maintaining motion continuity [34][40] - The architecture employs adaptive layer normalization and scale-adaptive attention to filter redundant information and strengthen key interaction information extraction [34][40] Group 5: CymbaDiff - The CymbaDiff model defines a new task for sketch-based 3D outdoor semantic scene generation, achieving a FID of 40.74 on the Sketch-based SemanticKITTI dataset [44][47] - It introduces a large-scale benchmark dataset, SketchSem3D, for evaluating 3D semantic scene generation [44][47] - The model employs a Cylinder Mamba diffusion mechanism to enhance spatial coherence and local neighborhood relationships [44][47] Group 6: DriveCritic - The DriveCritic framework utilizes vision-language models for context-aware evaluation of autonomous driving, achieving a 76.0% accuracy in human preference alignment tasks [55][58] - It addresses limitations of existing evaluation metrics by focusing on context sensitivity and human alignment in nuanced driving scenarios [55][58] - The framework demonstrates superior performance compared to traditional metrics, providing a reliable solution for human-aligned evaluation in autonomous driving [55][58]
X @Demis Hassabis
Demis Hassabis· 2025-10-09 21:44
Product Innovation - Google DeepMind's Genie 3, a world model capable of generating interactive environments from text or image prompts, is recognized as one of TIME's Best Inventions of 2025 [1][2] - Genie 3 enables the creation of entire playable worlds from a single image or text prompt [1] Company Recognition - Google DeepMind expresses pride in the Genie 3 team for its recognition by TIME [1] - Google DeepMind announces Genie 3 has been named one of TIME's Best Inventions of 2025 [1]
首个代码世界模型引爆AI圈,能让智能体学会「真推理」,Meta开源
具身智能之心· 2025-09-26 00:04
Core Viewpoint - The article discusses the introduction of the Code World Model (CWM) by Meta, which represents a significant evolution in AI models aimed at improving code generation through world modeling techniques [2][5][31]. Group 1: Model Overview - CWM is a 32 billion parameter open-weight large language model (LLM) designed to enhance code generation research based on world models [7][12]. - It features a dense, decoder-only architecture with a context length of up to 131k tokens, demonstrating strong performance in general programming and mathematical tasks [8][9]. Group 2: Performance Metrics - CWM achieved notable scores in various benchmarks: SWE-bench Verified (pass@1 65.8%), LiveCodeBench (68.6%), Math-500 (96.6%), and AIME 2024 (76.0%) [8][23]. - In comparison to other models, CWM's performance is competitive, particularly in the 30B parameter range [9][30]. Group 3: Training Methodology - The model was trained using extensive observational-action trajectories in a Python interpreter and agent-based Docker environment, focusing on improving code understanding beyond static code training [12][22]. - Meta has made available checkpoints from the mid-training, SFT, and reinforcement learning phases to support further research [13]. Group 4: Research Implications - CWM serves as a robust testing platform to explore the potential of world modeling in enhancing reasoning and planning capabilities in code generation [15][31]. - The research indicates that world models can benefit agent-based coding by allowing for stepwise simulation of Python code execution, which enhances reasoning from such simulations [16][31]. Group 5: Future Directions - Meta envisions that the code world model will bridge the gap between linguistic reasoning and executable semantics, with ongoing research needed to fully leverage its advantages across tasks [31]. - The model aims to improve reinforcement learning by enabling agents familiar with environmental dynamics to focus on learning actions that yield rewards [31].
人形机器人考察要点_市场展望、组件与具身人工智能-Humanoid Robot tour takeaways_ market outlook, components and embodied AI
2025-09-18 13:09
Summary of Conference Call Notes on Greater China Industrials (Humanoid Robots and Autonomous Driving) Industry Overview - The humanoid robot and autonomous driving (AD) sectors in China are expected to experience rapid expansion over the next decade, with significant growth anticipated in factory settings within 2-3 years and further opportunities in commercial and household applications in the long term [1][1] - The current bill of materials (BOM) cost for a fully-functional humanoid robot is approximately US$50-60k, with expectations for rapid cost reductions in the next five years due to improved product design and economies of scale [1][1] - Stricter regulations in the AD sector are anticipated to create more opportunities for AD components, particularly for LiDAR technology, which will benefit from new long-distance object detection requirements [1][1] Key Players and Developments Dobot - Dobot is a leading global collaborative robot (COBOT) brand, achieving a 47% year-over-year growth in 6-axis COBOT sales in the first half of 2025, indicating market share gains [8][8] - The company has entered the humanoid robot market, launching its first prototype in early 2025 and planning deployment in manufacturing and business scenarios [9][9] RoboSense - RoboSense is focusing on its new EMX LiDAR products, which offer superior precision and detection distance compared to competitors, with expectations to ship 600-700k units in 2025 and 1.5 million units in 2026 [10][10] - The company is also exploring opportunities in the lawn mower, unmanned delivery, and robotaxi industries, with significant partnerships established [11][11] Zhaowei Machinery & Electronics - Zhaowei has launched new dexterous hand models for humanoid robots and aims for a 10-15% global market share in this segment [12][12][13][13] - The BOM cost of the dexterous hand is estimated to account for 20-30% of the total BOM cost of a humanoid robot [13][13] Googol Technology - Googol Technology specializes in high-end control systems for advanced manufacturing and sees strong growth potential in humanoid robots due to its expertise in multi-degree-of-freedom (DoF) controlling [14][15] Minieye - Minieye is making progress with its smart driving solutions, including iPilot and iRobo, and anticipates significant growth in the penetration of front-view camera modules and driver monitoring systems due to new safety regulations [16][17] Leju Robotics - Leju targets to deliver over 1,000 units of robotics in 2025, focusing on stability and durability for large-scale applications [18][18] Orbbec - Orbbec is a leading player in robot vision systems, holding over 70% market share in 3D vision systems for service robots in China [21][21][22][22] UBTECH - UBTECH aims to ship 500 humanoid robots in 2025 and 2,000-3,000 units in 2026, with expectations for BOM cost reductions in the coming years [23][23][24][24] LK Tech - LK Tech is focusing on magnesium alloy technology for humanoid robots, which offers lightweighting and other advantages, and has signed cooperation agreements for R&D projects [25][26][26] Technology Insights - The competition between VLA (Vision-Language-Action) and world model technologies for embodied AI is highlighted, with data availability being a key bottleneck [3][3] - The vision system of humanoid robots is evolving, with depth cameras becoming the mainstream choice for enhancing sensing and navigation capabilities [22][22] Market Outlook - The humanoid robot market is expected to grow significantly, with projections of 3 million units shipped by 2030, leading to substantial opportunities for component suppliers [13][13] - The average selling price (ASP) of humanoid robots is expected to decline to approximately RMB150k (~US$20k) by 2026-2028 due to scale effects [20][20] Conclusion - The humanoid robot and AD sectors in Greater China are poised for significant growth, driven by technological advancements, regulatory changes, and increasing market demand. Key players are actively innovating and expanding their product offerings to capture market share in this rapidly evolving landscape.
X @Demis Hassabis
Demis Hassabis· 2025-08-24 02:15
AI Development & Innovation - AI can now be trained within another AI, indicating a significant advancement in AI training methodologies [1] - The world model, Genie 3, can imagine and generate new worlds dynamically, showcasing its advanced simulation capabilities [1] - An embodied agent, Sima, can autonomously navigate these AI-generated environments, demonstrating progress in embodied intelligence [1] - The entire environment-to-action loop is now generated by AI, highlighting the potential for fully AI-driven training simulations [1] - The industry anticipates the development of world simulators for training general embodied intelligence, suggesting future research directions [1]
又有很多自动驾驶工作中稿了ICCV 2025,我们发现了一些新趋势的变化...
自动驾驶之心· 2025-08-16 00:03
Core Insights - The article discusses the latest trends and research directions in the field of autonomous driving, highlighting the integration of multimodal large models and vision-language action generation as key areas of focus for both academia and industry [2][5]. Group 1: Research Directions - The research community is concentrating on several key areas, including the combination of MoE (Mixture of Experts) with autonomous driving, benchmark development for autonomous driving, and trajectory generation using diffusion models [2]. - The closed-loop simulation and world models are emerging as critical needs in autonomous driving, driven by the limitations of real-world open-loop testing. This approach aims to reduce costs and improve model iteration efficiency [5]. - There is a notable emphasis on performance improvement in object detection and OCC (Occupancy Classification and Counting), with many ongoing projects exploring specific pain points and challenges in these areas [5]. Group 2: Notable Projects and Publications - "ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation" is a significant project from Huazhong University of Science and Technology and Xiaomi, focusing on integrating vision and language for action generation in autonomous driving [5]. - "All-in-One Large Multimodal Model for Autonomous Driving" is another important work from Zhongshan University and Meituan, contributing to the development of comprehensive models for autonomous driving [6]. - "MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding" from Chongqing University aims to enhance understanding of driving scenarios through multimodal analysis [8]. Group 3: Simulation and Reconstruction - The project "Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images" from TUM focuses on advanced reconstruction techniques for autonomous driving [14]. - "CoDa-4DGS: Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving" from Fraunhofer IVI and TU Munich is another notable work that addresses dynamic scene reconstruction [16]. Group 4: Trajectory Prediction and World Models - "Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics" from Hong Kong University of Science and Technology and Didi emphasizes the importance of trajectory prediction in autonomous driving [29]. - "World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model" from the Chinese Academy of Sciences focuses on developing a comprehensive world model for autonomous driving [32].
DeepMind科学家揭秘Genie 3:自回归架构如何让AI建构整个世界 | Jinqiu Select
锦秋集· 2025-08-06 09:07
Core Viewpoint - Google DeepMind has introduced Genie 3, a revolutionary general world model capable of generating highly interactive 3D environments from text prompts or images, supporting real-time interaction and dynamic modifications [1][2]. Group 1: Breakthrough Technology - Genie 3 is described as a "paradigm-shifting" AI technology that could unlock a trillion-dollar commercial landscape and potentially become a "killer application" in the virtual reality (VR) sector [9]. - The technology integrates features of traditional game engines, physics simulators, and video generation models, creating a real-time interactive world model [9]. Group 2: Evolution of World Models - The construction of virtual worlds has evolved from manual coding methods, exemplified by the 1996 Quake engine, to AI-generated models that learn from vast amounts of real-world video data [10]. - The ultimate goal is to generate any desired interactive world from a simple text prompt, providing diverse environments for AI training [10]. Group 3: Genie Iteration Journey - The initial version of Genie was trained on 30,000 hours of 2D platform game footage, demonstrating an early understanding of the physical world [11]. - Genie 2 achieved a leap to 3D with near real-time performance and improved visual fidelity, simulating real-world lighting effects [12]. - Genie 3 further enhances this technology with a resolution of 720p, enabling immersive experiences and real-time interaction [13]. Group 4: Key Features - Genie 3 shifts input from images to text prompts, allowing for greater creative flexibility [15]. - It supports diverse environments, long-term interactions, and prompt-controlled world events, crucial for simulating rare occurrences in scenarios like autonomous driving [15]. Group 5: Technical Insights - Genie 3 maintains world consistency through an emergent property of its architecture, generating frames while referencing previous events [16]. - This causal generation method aligns with real-world time flow, enhancing the model's ability to simulate complex environments [16]. Group 6: Applications and Future Implications - Genie 3 is positioned as a platform for training embodied agents, potentially leading to groundbreaking strategies in AI development [17]. - It allows for low-cost, safe simulations of various scenarios, addressing the scarcity of real-world data for training [17]. Group 7: Creativity and Human Collaboration - DeepMind scientists argue that Genie 3's reliance on high-quality prompts enhances human creativity, providing a powerful tool for creators [19]. - This technology may herald a new form of interactive entertainment, enabling users to collaboratively create and explore interconnected virtual worlds [19]. Group 8: Limitations and Challenges - Genie 3 is still a research prototype with limitations, such as supporting only single-agent experiences and facing reliability issues [20]. - There exists a cognitive gap in fully simulating human experiences beyond visual and auditory senses [20]. Group 9: Technical Specifications and Industry Impact - Genie 3 operates on Google's TPU network, indicating significant computational demands, with training data likely sourced from extensive video content [21]. - The technology is expected to greatly impact the creative industry by simplifying the production of interactive graphics, while not simply replacing traditional game engines [22]. Group 10: Closing Remarks - Genie 3 represents a significant advancement in realistic world simulation, potentially bridging the long-standing "sim-to-real" gap in AI applications [23].
深夜,OpenAI、谷歌等更新多款模型
第一财经· 2025-08-06 07:17
Core Insights - The article discusses the recent product launches by major AI model companies, highlighting shifts in product strategies and advancements in AI capabilities [3][11]. Group 1: OpenAI Developments - OpenAI has released two new open-source models, gpt-oss-120b with 117 billion parameters and gpt-oss-20b with 21 billion parameters, both utilizing the MoE architecture [4][5]. - The gpt-oss-120b model can run on a single 80GB GPU, while gpt-oss-20b can operate on consumer devices with 16GB memory, allowing for local deployment on laptops and smartphones [5][6]. - OpenAI's new models have shown competitive performance in benchmark tests, with gpt-oss-120b scoring close to or exceeding the closed-source o4-mini model [5][6]. Group 2: Anthropic's Strategy - Anthropic has shifted to a strategy of more frequent incremental updates, exemplified by the release of Claude Opus 4.1, which improves upon its predecessor in areas like coding and data analysis [6][7]. - In benchmark tests, Claude Opus 4.1 scored 74.5%, surpassing Opus 4's 72.5%, indicating enhanced coding capabilities [7]. Group 3: Google's Innovations - Google introduced Genie 3, its first world model that supports real-time interaction, building on previous models like Genie 1 and 2 [8][9]. - Genie 3 can simulate complex environments and interactions, generating consistent visuals for several minutes, a significant improvement over Genie 2 [9][11]. - Despite its advancements, Genie 3 still faces limitations, such as restricted action spaces and challenges in simulating multiple agents in shared environments [11].
X @Demis Hassabis
Demis Hassabis· 2025-08-05 15:21
Technology & Innovation - Google DeepMind introduces Genie 3, a groundbreaking world model for creating interactive environments from text prompts [1] - Genie 3 enables the generation of playable environments from a single text prompt [1] - The technology allows for the creation of diverse environments, ranging from photorealistic landscapes to fantasy realms [1] Potential Applications - The generated videos are not just for viewing but can be explored interactively [1] - The possibilities for interactive and playable environments are described as endless [1]