OmniNWM
Search documents
东方理工金鑫:如何找到自动驾驶与机器人统一的「空间语言」丨GAIR 2025
雷峰网· 2025-12-14 06:27
Core Viewpoint - The article discusses the emerging paradigm of "world models" in AI, emphasizing the importance of integrating physical rules and data-driven methods to enhance machine intelligence and its applications in industries like manufacturing and autonomous driving [2][4][5]. Group 1: Researcher and Team Insights - Researcher Jin Xin from Ningbo Oriental Institute of Technology is focusing on "embodied world models" for decision-making, collaborating with institutions like Shanghai Jiao Tong University and Tsinghua University [3]. - Jin's team is exploring a "hybrid" approach to building world models, combining explicit physical rules with data-driven methods to address complex phenomena [4]. Group 2: Applications and Industry Collaboration - The team is applying their methods in industrial manufacturing, collaborating with leading companies in Ningbo to validate their "factory world model" [5]. - The advancements in world models are seen as a significant leap in technology, with applications in autonomous driving, robotics, AIGC, AR, and VR [9]. Group 3: Space Intelligence Framework - The framework for space intelligence is divided into three parts: spatial perception, spatial interactivity, and spatial understanding, generalization, and generation [10][12][13][14]. - The process involves a "modeling-training" loop where AI agents are trained in simulated environments, leading to continuous optimization [18]. Group 4: Specific Projects and Innovations - The project "UniScene" focuses on generating driving scenarios, addressing the limitations of traditional data collection methods in the automotive industry [20][22]. - The "OmniNWM" project introduces a closed-loop mechanism for planning and generating future states based on trajectory inputs [42][44]. - The "InterVLA" dataset aims to provide first-person perspective data for robots, enhancing their interaction capabilities [46][57]. Group 5: Challenges and Future Directions - The article highlights the challenges in creating realistic world models, particularly in embedding complex physical rules and ensuring data quality [98][104]. - The research emphasizes a mixed approach, combining knowledge-based constraints with data-driven learning to improve the understanding of physical laws in AI models [106].
上交OmniNWM:突破三维驾驶仿真极限的「全知」世界模型
自动驾驶之心· 2025-10-24 16:03
Core Insights - The article discusses the OmniNWM research, which proposes a panoramic, multi-modal driving navigation world model that significantly surpasses existing state-of-the-art (SOTA) models in terms of generation quality, control precision, and long-term stability, setting a new benchmark for simulation training and closed-loop evaluation in autonomous driving [2][58]. Group 1: OmniNWM Features - OmniNWM integrates state generation, action control, and reward evaluation into a unified framework, addressing the limitations of existing models that rely on single-modal RGB video and sparse action encoding [10][11]. - The model utilizes a Panoramic Diffusion Transformer (PDiT) to jointly generate pixel-aligned outputs across four modalities: RGB, semantic, depth, and 3D occupancy [12][11]. - OmniNWM introduces a normalized Plücker Ray-map for action control, allowing for pixel-level guidance and improved generalization across out-of-distribution (OOD) trajectories [18][22]. Group 2: Challenges and Solutions - The article identifies three core challenges in current autonomous driving world models: limitations in state representation, ambiguity in action control, and lack of integrated reward mechanisms [8][10]. - OmniNWM's approach to state generation overcomes the limitations of existing models by capturing the full geometric and semantic complexity of real-world driving scenarios [10][11]. - The model's reward system is based on the generated 3D occupancy, providing a dense and integrated reward function that enhances the evaluation of driving behavior [35][36]. Group 3: Performance Metrics - OmniNWM supports the generation of long video sequences, exceeding the ground truth length with stable outputs, demonstrating its capability to generate over 321 frames [31][29]. - The model achieves significant improvements in video generation quality, outperforming existing models in metrics such as FID and FVD [51][52]. - The integration of a Vision-Language-Action (VLA) planner enhances the model's ability to understand multi-modal environments and output high-precision trajectories [43][50].