Workflow
OmniNWM
icon
Search documents
东方理工金鑫:如何找到自动驾驶与机器人统一的「空间语言」丨GAIR 2025
雷峰网· 2025-12-14 06:27
" 当AI拥有「思维链」,赋予机器想象力的世界模型训练新范式。 " 作者丨吴彤 编辑丨 林觉民 在人工智能研究正以前所未有的速度迭代的今天,一位研究者如果同时聚焦于世界模型与具身智能这类高度前沿的课题,并且强调产业应用和市场接受度才是 技术真正的试金石,这可能本身就成为了一种值得关注的信号。 宁波东方理工大学助理教授金鑫便是这样一位研究者。 我们近期的一次交流,恰逢他的团队在美国圣地亚哥NeurIPS会议的活动告一段落——他与上海交通大学、布里斯托大学、清华大学等高校的合作者们在那组 织了一场关于"具身世界模型"( Embodied World Models for Decision Making)的研讨会,并有多位学界和产业界大咖受邀参加并作报告。 从早期的图像视频信号处理、压缩等底层视觉任务,到近年聚焦于表征解耦、世界模型、空间智能等方向,金鑫的研究不断从低维信息向高维信息跃迁,不断 尝试新的挑战,试图让机器变得更加智能,更好地理解物理世界并服务实际产业,其研究路径也反映出AI领域逐渐从简单的感知走向更加复杂的认知与决策。 然而,当对话触及这些光环之下的研究内核时,他表现出一种审慎。 "这只是我们团队现阶 ...
上交OmniNWM:突破三维驾驶仿真极限的「全知」世界模型
自动驾驶之心· 2025-10-24 16:03
Core Insights - The article discusses the OmniNWM research, which proposes a panoramic, multi-modal driving navigation world model that significantly surpasses existing state-of-the-art (SOTA) models in terms of generation quality, control precision, and long-term stability, setting a new benchmark for simulation training and closed-loop evaluation in autonomous driving [2][58]. Group 1: OmniNWM Features - OmniNWM integrates state generation, action control, and reward evaluation into a unified framework, addressing the limitations of existing models that rely on single-modal RGB video and sparse action encoding [10][11]. - The model utilizes a Panoramic Diffusion Transformer (PDiT) to jointly generate pixel-aligned outputs across four modalities: RGB, semantic, depth, and 3D occupancy [12][11]. - OmniNWM introduces a normalized Plücker Ray-map for action control, allowing for pixel-level guidance and improved generalization across out-of-distribution (OOD) trajectories [18][22]. Group 2: Challenges and Solutions - The article identifies three core challenges in current autonomous driving world models: limitations in state representation, ambiguity in action control, and lack of integrated reward mechanisms [8][10]. - OmniNWM's approach to state generation overcomes the limitations of existing models by capturing the full geometric and semantic complexity of real-world driving scenarios [10][11]. - The model's reward system is based on the generated 3D occupancy, providing a dense and integrated reward function that enhances the evaluation of driving behavior [35][36]. Group 3: Performance Metrics - OmniNWM supports the generation of long video sequences, exceeding the ground truth length with stable outputs, demonstrating its capability to generate over 321 frames [31][29]. - The model achieves significant improvements in video generation quality, outperforming existing models in metrics such as FID and FVD [51][52]. - The integration of a Vision-Language-Action (VLA) planner enhances the model's ability to understand multi-modal environments and output high-precision trajectories [43][50].