理想DrivingScene
Search documents
理想DrivingScene: 两帧图像实时重建动态驾驶场景
理想TOP2· 2025-11-02 09:08
Research Background and Challenges - The safety and reliability of autonomous driving systems heavily depend on 4D dynamic scene reconstruction, which includes real-time, high-fidelity environmental perception in 3D space plus the time dimension. The industry faces two core contradictions: the limitations of static feedforward solutions, which assume "no dynamics in the scene," leading to severe artifacts when encountering moving targets like vehicles and pedestrians, making them unsuitable for real driving scenarios [1]. Core Innovations - Harbin Institute of Technology, in collaboration with Li Auto and other research teams, has achieved three key design breakthroughs to unify "real-time performance, high fidelity, and multi-task output" [2]. Related Work Overview - Static driving scene reconstruction methods include DrivingForward, pixelSplat, MVSplat, and DepthSplat, which have shown limitations in adapting to dynamic environments [3]. Key Technical Solutions - A two-stage training paradigm is proposed, where a robust static scene prior is learned from large-scale data before training the dynamic module, addressing the instability of end-to-end training and reducing the complexity of dynamic modeling [4]. - A hybrid shared architecture with a residual flow network is designed, featuring a shared depth encoder and a single-camera decoder to predict only the non-rigid motion residuals of dynamic objects, ensuring cross-view scale consistency and computational efficiency [4]. - A pure visual online feedforward framework is introduced, which inputs two consecutive panoramic images to output 3D Gaussian point clouds, depth maps, and scene flows in real-time, meeting the online perception needs of autonomous driving without offline optimization or multi-modal sensors [4]. Experimental Validation and Results Analysis - The method significantly outperforms existing feedforward baselines in quantitative results, achieving a PSNR of 28.76, which is 2.66 dB higher than Driv3R and 2.7 dB higher than DrivingForward, and an SSIM of 0.895, indicating superior rendering fidelity [28]. - The efficiency analysis shows that the proposed method has a faster inference time of 0.21 seconds per frame, which is 38% faster than DrivingForward and 70% faster than Driv3R, with a training cost of approximately 5 days and VRAM usage of 27.3 GB, significantly lower than Driv3R [30]. - Ablation studies confirm the necessity of the residual flow network, two-stage training, and flow distortion loss, highlighting their critical roles in dynamic modeling and rendering quality [32][34].