Feed-Forward 3D综述：3D视觉进入“一步到位”时代

Core Insights - The article discusses the evolution of 3D vision technologies, highlighting the transition from traditional methods like Structure-from-Motion (SfM) to advanced techniques such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), emphasizing the emergence of Feed-Forward 3D as a new paradigm in the AI-driven era [2][6]. Summary by Categories 1. Technological Evolution - The article outlines the historical progression in 3D vision, noting that previous methods often required per-scene optimization, which was slow and lacked generalization capabilities [2][6]. - Feed-Forward 3D is introduced as a new paradigm that aims to overcome these limitations, enabling faster and more generalized 3D understanding [2]. 2. Classification of Feed-Forward 3D Methods - The article categorizes Feed-Forward 3D methods into five main architectures, each contributing to significant advancements in the field: 1. NeRF-based Models: These models utilize a differentiable framework for volume rendering but face efficiency issues due to scene-specific optimization. Conditional NeRF approaches have emerged to allow direct prediction of radiance fields [8]. 2. PointMap Models: Led by DUSt3R, these models predict pixel-aligned 3D point clouds directly within a Transformer framework, eliminating the need for camera pose input [10]. 3. 3D Gaussian Splatting (3DGS): This innovative representation uses Gaussian point clouds to balance rendering quality and speed, with advancements allowing direct output of Gaussian parameters [11][13]. 4. Mesh / Occupancy / SDF Models: These methods combine traditional geometric modeling with modern techniques like Transformers and Diffusion models [14]. 5. 3D-Free Models: These models learn mappings from multi-view inputs to new perspectives without relying on explicit 3D representations [15]. 3. Applications and Tasks - The article highlights diverse applications of Feed-Forward models, including: - Pose-Free Reconstruction & View Synthesis - Dynamic 4D Reconstruction & Video Diffusion - SLAM and visual localization - 3D-aware image and video generation - Digital human modeling - Robotic manipulation and world modeling [19]. 4. Benchmarking and Evaluation Metrics - The article mentions the inclusion of over 30 commonly used 3D datasets, covering various types of scenes and modalities, and summarizes standard evaluation metrics such as PSNR, SSIM, and Chamfer Distance for future model comparisons [20][21]. 5. Future Challenges and Trends - The article identifies four major open questions for future research, including the need for multi-modal data, improvements in reconstruction accuracy, challenges in free-viewpoint rendering, and the limitations of long-context reasoning in processing extensive frame sequences [25][26].