Workflow
自动驾驶场景生成
icon
Search documents
最新世界模型!WorldSplat:用于自动驾驶的高斯中心前馈4D场景生成(小米&南开)
自动驾驶之心· 2025-10-02 03:04
Core Insights - The article introduces WorldSplat, a novel feedforward framework that integrates generative methods with explicit 3D reconstruction for 4D driving scene synthesis, addressing the challenges of generating controllable and realistic driving scene videos [5][36]. Background Review - Generating controllable and realistic driving scene videos is a core challenge in autonomous driving and computer vision, crucial for scalable training and closed-loop evaluation [5]. - Existing generative models have made progress in high-fidelity, user-customized video generation, reducing reliance on expensive real data, while urban scene reconstruction methods have optimized 3D representation and consistency for new view synthesis [5][6]. - Despite advancements, generative and reconstruction methods face challenges in creating unknown environments and synthesizing new views, with existing video generation models often lacking 3D consistency and controllability [5][6]. WorldSplat Framework - WorldSplat combines generative diffusion with explicit 3D reconstruction, constructing dynamic 4D Gaussian representations that can render new views along any user-defined camera trajectory without scene-by-scene optimization [6][10]. - The framework consists of three key modules: a 4D perception latent diffusion model for multimodal latent variable generation, a latent Gaussian decoder for feedforward 4D Gaussian prediction and real-time trajectory rendering, and an enhanced diffusion model for video quality optimization [10][12]. Algorithm Details - The 4D perception latent diffusion model generates multimodal latent variables containing RGB, depth, and dynamic target information based on user-defined control conditions [14][15]. - The latent Gaussian decoder predicts pixel-aligned 3D Gaussian distributions, separating static backgrounds from dynamic targets to create a unified 4D representation [20][21]. - The enhanced diffusion model optimizes the rendered RGB video based on both the original input and the rendered video, enriching spatial details and enhancing temporal coherence [24][27]. Experimental Results - Extensive experiments demonstrate that WorldSplat achieves state-of-the-art performance in generating high-fidelity, temporally consistent free-view videos, significantly benefiting downstream driving tasks [12][36]. - Comparative results show that WorldSplat outperforms existing generative and reconstruction techniques in terms of realism and new view quality [31][32]. Conclusion - The proposed WorldSplat framework effectively integrates generative and reconstruction methods, enabling the generation of explicit 4D Gaussian distributions optimized for high-fidelity, temporally and spatially consistent multi-trajectory driving videos [36].