Computer Vision

Search documents
三维重建综述:从多视角几何到 NeRF 与 3DGS 的演进
自动驾驶之心· 2025-09-22 23:34
Core Viewpoint - 3D reconstruction is a critical intersection of computer vision and graphics, serving as the digital foundation for cutting-edge applications such as virtual reality, augmented reality, autonomous driving, and digital twins. Recent advancements in new perspective synthesis technologies, represented by Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have significantly improved reconstruction quality, speed, and dynamic adaptability [5][6]. Group 1: Introduction and Demand - The resurgence of interest in 3D reconstruction is driven by new application demands across various fields, including city-scale digital twins requiring kilometer-level coverage and centimeter-level accuracy, autonomous driving simulations needing dynamic traffic flow and real-time semantics, and AR/VR social applications demanding over 90 FPS and photo-realistic quality [6]. - Traditional reconstruction pipelines are inadequate for these new requirements, prompting the integration of geometry, texture, and lighting through differentiable rendering techniques [6]. Group 2: Traditional Multi-View Geometry Reconstruction - The traditional multi-view geometry approach (SfM to MVS) has inherent limitations in quality, efficiency, and adaptability to dynamic scenes, which have been addressed through iterative advancements in NeRF and 3DGS technologies [7]. - A comprehensive comparison of various methods highlights the evolution and future challenges in the field of 3D reconstruction [7]. Group 3: NeRF and Its Innovations - NeRF models scenes as continuous 5D functions, enabling advanced rendering techniques that have evolved significantly from 2020 to 2024, addressing issues such as data requirements, texture limitations, lighting sensitivity, and dynamic scene handling [13][15]. - Various methods have been developed to enhance quality and efficiency, including Mip-NeRF, NeRF-W, and InstantNGP, each contributing to improved rendering speeds and reduced memory usage [17][18]. Group 4: 3DGS and Its Advancements - 3DGS represents scenes as collections of 3D Gaussians, allowing for efficient rendering and high-quality output. Recent methods have focused on optimizing rendering quality and efficiency, achieving significant improvements in memory usage and frame rates [22][26]. - The comparison of 3DGS with other methods shows its superiority in rendering speed and dynamic scene reconstruction capabilities [31]. Group 5: Future Trends and Conclusion - The next five years are expected to see advancements in hybrid representations, real-time processing on mobile devices, generative reconstruction techniques, and multi-modal fusion for robust reconstruction [33]. - The ultimate goal is to enable real-time 3D reconstruction accessible to everyone, marking a shift towards ubiquitous computing [34].
港科&地平线&浙大联手开源SAIL-Recon:三分钟重建一座城
自动驾驶之心· 2025-09-02 23:33
以下文章来源于3D视觉之心 ,作者3D视觉之心 3D视觉之心 . 3D视觉与SLAM、点云相关内容分享 点击下方 卡片 ,关注" 3D视觉之心 "公众号 第一时间获取 3D视觉干货 锚帧神经地图颠覆传统SfM 运动恢复结构(SfM)算法从一组无序图像同时估计相机位姿和场景结构,是众多计算机视觉应用的核心。传统 SfM 可分为增量式和全局式两条路线,均依赖特征提取、匹配、三角化与光束法平差;这些模块在低纹理、模糊或 重复纹理场景下易失效。 近期研究提出端到端可学习的 SfM 管线,直接由图像回归场景结构和相机位姿。DUSt3R 率先使用 Transformer 从 两张无位姿图像回归场景坐标图(SCM),后续工作将其扩展到多张图像甚至视频,但仍受限于 GPU 内存,难以 处理数千张图像的大规模场景。 另一方面,现有场景回归方法忽略了视觉定位——3D 视觉中的基本任务,可为 SfM 系统扩展提供关键支撑。 SLAM 系统仅在关键帧建图,对非关键帧执行定位,从而节省计算与内存。 文章标题 :SAIL-Recon: Large SfM by Augmenting Scene Regression with Local ...
多样化大规模数据集!SceneSplat++:首个基于3DGS的综合基准~
自动驾驶之心· 2025-06-20 14:06
Core Insights - The article introduces SceneSplat-Bench, a comprehensive benchmark for evaluating visual-language scene understanding methods based on 3D Gaussian Splatting (3DGS) [11][30]. - It presents SceneSplat-49K, a large-scale dataset containing approximately 49,000 raw scenes and 46,000 filtered 3DGS scenes, which is the most extensive open-source dataset for complex and high-quality scene-level 3DGS reconstruction [9][30]. - The evaluation indicates that generalizable methods consistently outperform per-scene optimization methods, establishing a new paradigm for scalable scene understanding through pre-trained models [30]. Evaluation Protocols - The benchmark evaluates methods based on two key metrics in 3D space: foreground mean Intersection over Union (f-mIoU) and foreground mean accuracy (f-mAcc), addressing object size imbalance and reducing viewpoint dependency compared to 2D evaluations [22][30]. - The evaluation dataset includes ScanNet, ScanNet++, and Matterport3D for indoor scenes, and HoliCity for outdoor scenes, emphasizing the methods' capabilities across various object scales and complex environments [22][30]. Dataset Contributions - SceneSplat-49K is compiled from multiple sources, including SceneSplat-7K, DL3DV-10K, HoliCity, and Aria Synthetic Environments, ensuring a diverse range of indoor and outdoor environments [9][10]. - The dataset preparation involved approximately 891 GPU days and extensive human effort, highlighting the significant resources invested in creating a high-quality dataset [7][9]. Methodological Insights - The article categorizes methods into three types: per-scene optimization methods, per-scene optimization-free methods, and generalizable methods, with SceneSplat representing the latter [23][30]. - Generalizable methods eliminate the need for extensive single-scene computations during inference, allowing for efficient processing of 3D scenes in a single forward pass [24][30]. Performance Results - The results from SceneSplat-Bench demonstrate that SceneSplat excels in both performance and efficiency, often surpassing the pseudo-label methods used for its pre-training [24][30]. - The performance of various methods shows significant variation based on the dataset's complexity, indicating the importance of challenging benchmarks in revealing the limitations of competing methods [28][30].
无需昂贵设备,单目方案生成超逼真3D头像,清华&IDEA新研究入选CVPR2025
量子位· 2025-05-22 14:29
Core Viewpoint - The article discusses the development of HRAvatar, a method for creating high-quality, relightable 3D avatars from monocular video, addressing challenges in animation, real-time rendering, and visual realism [1][4][6]. Group 1: Methodology and Innovations - HRAvatar utilizes a learnable deformation basis and linear skinning techniques to achieve flexible and precise geometric transformations [1][6]. - An end-to-end expression encoder is introduced to enhance the accuracy of expression parameter extraction, reducing tracking errors and ensuring generalization [6][10]. - The method decomposes the avatar's appearance into material properties such as albedo, roughness, and Fresnel reflectance, employing a simplified BRDF model for shading [6][16]. Group 2: Performance and Results - HRAvatar demonstrates superior performance across various metrics, achieving a PSNR of 30.36, MAE of 0.845, SSIM of 0.9482, and LPIPS of 0.0569, outperforming existing methods [24][26]. - The method achieves real-time rendering speeds of approximately 155 FPS under driving and relighting conditions [25]. - Experimental results indicate that HRAvatar excels in detail richness and quality, particularly in LPIPS scores, suggesting enhanced avatar detail [24][34]. Group 3: Applications and Future Directions - The reconstructed avatars can be animated and relit in new environmental lighting conditions, allowing for simple material editing [28]. - The introduction of HRAvatar expands the application scenarios for monocular Gaussian virtual avatar modeling, with the code being open-sourced for public use [35][36].
ICML 2025 Spotlight | 用傅里叶分解探讨图像对抗扰动,代码已开源
机器之心· 2025-05-18 04:25
Core Viewpoint - The article discusses a novel approach to adversarial purification in computer vision, focusing on the frequency domain to effectively separate adversarial perturbations from clean images while preserving semantic information [5][21]. Research Background - Adversarial samples pose significant challenges to the safety and robustness of models in computer vision, necessitating effective adversarial purification techniques to restore original clean images [5]. - Existing adversarial purification methods are categorized into training-based and diffusion model-based approaches, with the latter offering stronger generalization capabilities without requiring extensive training data [5][6]. Motivation and Theoretical Analysis - The key to successful adversarial purification lies in eliminating adversarial perturbations while retaining the semantic information of the original image [9]. - Current strategies that add noise to mask adversarial perturbations often excessively damage the semantic content of the original image [9]. - The study employs Fourier decomposition to analyze the distribution characteristics of adversarial perturbations, revealing that they predominantly affect high-frequency components, while low-frequency components are more robust [9][12]. Methodology - A filter is constructed to retain low-frequency amplitude spectrum components, which are less affected by adversarial perturbations, while allowing for the replacement of these components with those from the original clean image [14][15]. - The phase spectrum is also addressed, as it is influenced by adversarial perturbations across all frequency components; thus, a projection method is used to maintain the integrity of the phase information [16][17]. Experimental Results - The proposed method demonstrates improved performance in both standard and robust accuracy metrics compared to state-of-the-art (SOTA) methods on datasets such as CIFAR10 and ImageNet [18][19]. - Visualizations indicate that the purified images closely resemble the original clean images, confirming the effectiveness of the proposed approach [20]. Conclusion - While significant progress has been made in preserving semantic information and removing adversarial perturbations, further exploration into more effective image decomposition methods and deeper theoretical explanations remains a future research direction [21].
CVPR 2025 Oral | DiffFNO:傅里叶神经算子助力扩散,开启任意尺度超分辨率新篇章
机器之心· 2025-05-04 04:57
Core Viewpoint - The article discusses the development of DiffFNO, a novel method that enhances diffusion models with neural operators to achieve high-quality and efficient super-resolution (SR) for images at any continuous scaling factor, addressing the challenges of traditional models [2][4]. Group 1: Methodology Overview - DiffFNO consists of three main components: Weighted Fourier Neural Operator (WFNO), Gated Fusion Mechanism, and Adaptive ODE Solver, which collectively improve the quality and efficiency of image reconstruction [2][5]. - The WFNO captures global information through frequency domain convolution and amplifies high-frequency components using learnable frequency weights, resulting in a PSNR improvement of approximately 0.3–0.5 dB in high-magnification tasks [10]. - The Gated Fusion Mechanism integrates a lightweight attention operator (AttnNO) to capture local spatial features, allowing for a flexible combination of spectral and spatial information [12][13]. Group 2: Adaptive ODE Solver - The Adaptive ODE Solver transforms the diffusion model's reverse process from a stochastic SDE to a deterministic ODE, significantly reducing the number of steps required for denoising from over a thousand to about thirty, thus enhancing inference speed [15]. - This method maintains image quality while halving the inference time from 266 ms to approximately 141 ms, even performing better at larger scaling factors [15]. Group 3: Experimental Validation - DiffFNO outperforms various state-of-the-art (SOTA) methods by 2–4 dB in PSNR across multiple benchmark datasets, particularly excelling in high magnification scenarios such as ×8 and ×12 [17][20]. - The method retains the complete Fourier spectrum, balancing overall image structure and local detail, and employs learnable frequency weights to dynamically adjust the influence of different frequency bands [18]. Group 4: Conclusion - The introduction of DiffFNO provides a new approach to reconcile the trade-off between high precision and low computational cost in super-resolution tasks, making it suitable for fields requiring high image quality, such as medical imaging, exploration, and gaming [22].