3D重建
Search documents
分割一切并不够,还要3D重建一切,SAM 3D来了
具身智能之心· 2025-11-21 00:04
Core Viewpoint - Meta has launched significant updates with the introduction of SAM 3D and SAM 3, enhancing the understanding of images in 3D and providing advanced capabilities for object detection, segmentation, and tracking in images and videos [2][6][40]. Group 1: SAM 3D Overview - SAM 3D is the latest addition to the SAM series, featuring two models: SAM 3D Objects and SAM 3D Body, both demonstrating state-of-the-art performance in converting 2D images into detailed 3D reconstructions [2][4]. - SAM 3D Objects allows users to generate 3D models from a single image, overcoming limitations of traditional 3D modeling that often relies on isolated or synthetic data [11][15]. - Meta has annotated nearly 1 million real-world images, generating approximately 3.14 million 3D meshes, utilizing a scalable data engine to enhance the quality and quantity of 3D data [20][26]. Group 2: SAM 3D Body - SAM 3D Body focuses on accurate 3D human pose and shape reconstruction from single images, maintaining high-quality performance even in complex scenarios with occlusions and unusual poses [28][30]. - The model is interactive, allowing users to guide and control predictions, enhancing accuracy and usability [29]. - A high-quality training dataset of around 8 million images was created to improve the model's performance across various 3D benchmarks [33]. Group 3: SAM 3 Capabilities - SAM 3 introduces promptable concept segmentation, enabling the model to detect and segment specific concepts based on text or example image prompts, significantly improving its performance in concept recognition [40][42]. - The architecture of SAM 3 builds on previous advancements, utilizing components like the Meta Perception Encoder and DETR for enhanced image recognition and object detection capabilities [42][44]. - SAM 3 achieves a twofold increase in cgF1 scores for concept recognition and maintains near real-time performance for images with over 100 detection targets, completing inference in approximately 30 milliseconds on H200 GPUs [44].
谢赛宁盛赞字节Seed新研究!单Transformer搞定任意视图3D重建
量子位· 2025-11-18 05:02
闻乐 发自 凹非寺 量子位 | 公众号 QbitAI 单Transformer搞定任意视图3D重建! 这是字节Seed康炳易团队带来的最新研究成果 Depth Anything 3 (下称DA3),获谢赛宁盛赞。 架构足够简单,核心能力却不差。能从一张图、一组多视角照片甚至一段随手拍的视频里,精准算出物体深度、还原相机位置,不仅能拼出完 整3D场景,还能脑补出没拍过的新视角图像。 而且,它在团队全新打造的视觉几何基准上横扫所有任务,相机定位精度平均提升 35.7% ,几何重建准确率涨了 23.6% ,单目深度估计还 超越了自家前代DA2。 以前的3D视觉模型,想做单图深度估计?得单独训练一个模型;想搞多视角3D重建?又要换一套架构。 就连算个相机位置都得搭专属模块,不仅开发成本高,还没法充分利用大规模预训练模型的优势,数据依赖也很严重。 还有就是这些模型往往"术业有专攻",那DA3的单一极简操作究竟是怎样的呢? 极简设计也能打 核心秘诀就两点:一是只用一个普通的视觉Transformer当基础;二是预测目标只抓 深度 和 光线 两个核心。 从架构图上可以看出来,DA3的任务流程可分为四大环节。 首先是输入处理 ...
首个实例理解3D重建模型,NTU&阶越提出基于实例解耦的3D重建模型,助理场景理解
3 6 Ke· 2025-10-31 08:28
Core Insights - The article discusses the challenges AI faces in perceiving 3D geometry and semantic content, highlighting the limitations of traditional methods that separate 3D reconstruction from spatial understanding. A new approach, IGGT (Instance-Grounded Geometry Transformer), integrates these aspects into a unified model for improved performance in various tasks [1]. Group 1: IGGT Model Development - IGGT is an end-to-end large unified Transformer that combines spatial reconstruction and instance-level contextual understanding in a single model [1]. - The model is built on a new large-scale dataset, InsScene-15K, which includes 15,000 scenes and 200 million images, featuring high-quality, 3D-consistent instance-level masks [2]. - IGGT introduces the "Instance-Grounded Scene Understanding" paradigm, allowing it to operate independently of specific visual language models (VLMs) and enabling seamless integration with various VLMs and large multimodal models (LMMs) [3]. Group 2: Applications and Capabilities - The unified representation of IGGT significantly expands its downstream capabilities, supporting spatial tracking, open vocabulary segmentation, and scene question answering (QA) [4]. - The model's architecture includes a Geometry Head for predicting camera parameters and depth maps, and an Instance Head for decoding instance features, enhancing spatial perception [11][18]. - IGGT achieves high performance in instance 3D tracking tasks, with tracking IOU and success rates reaching 70% and 90%, respectively, making it the only model capable of successfully tracking objects that disappear and reappear [16]. Group 3: Data Collection and Processing - The InsScene-15K dataset is constructed through a novel data management process that integrates three different data sources, including synthetic data, real-world video capture, and RGBD capture [6][9][10]. - The synthetic data is generated in simulated environments, providing perfect accuracy for segmentation masks, while real-world data is processed through a custom pipeline to ensure temporal consistency [8][9]. Group 4: Performance Comparison - IGGT outperforms existing models in reconstruction, understanding, and tracking tasks, with significant improvements in understanding and tracking metrics compared to other models [16]. - The model's instance masks can serve as prompts for VLMs, enabling open vocabulary semantic segmentation and facilitating complex object-centric question answering tasks [19][24].
首个实例理解3D重建模型!NTU&阶越提出基于实例解耦的3D重建模型,助理场景理解
量子位· 2025-10-31 04:09
Core Insights - The article discusses the challenges AI faces in simultaneously understanding the geometric structure and semantic content of 3D worlds, which humans naturally perceive. Traditional methods separate 3D reconstruction from spatial understanding, leading to errors and limited generalization. The introduction of IGGT (Instance-Grounded Geometry Transformer) aims to unify these processes in a single model [1][2]. Group 1: IGGT Framework - IGGT is an end-to-end unified framework that integrates spatial reconstruction and instance-level contextual understanding within a single model [2]. - A new large-scale dataset, InsScene-15K, has been created, containing 15,000 scenes and 200 million images, with high-quality, 3D-consistent instance-level masks [2][5]. - The model introduces the "Instance-Grounded Scene Understanding" paradigm, allowing it to generate instance masks that can seamlessly integrate with various Vision Language Models (VLMs) and Language Models (LMMs) [2][18]. Group 2: Data Collection Process - The InsScene-15K dataset is constructed through a novel data management process driven by SAM2, integrating three different data sources [5]. - Synthetic data is generated in simulated environments, providing perfect accuracy for RGB images, depth maps, camera poses, and object-level segmentation masks [8]. - Real-world video collection involves a custom SAM2 pipeline that generates dense initial mask proposals and propagates these masks over time, ensuring high temporal consistency [9]. - Real-world RGBD data collection uses a mask optimization process to enhance the quality of 2D masks while maintaining 3D ID consistency [10]. Group 3: Model Architecture - The IGGT model architecture consists of a unified transformer that processes image tokens through attention modules to create a powerful unified token representation [14]. - It features dual decoding heads for geometry and instance predictions, employing a cross-modal fusion block to enhance spatial perception [17]. - The model utilizes a multi-view contrastive loss to learn 3D-consistent instance features from 2D inputs [15]. Group 4: Performance and Applications - IGGT is the first model capable of simultaneously performing reconstruction, understanding, and tracking tasks, showing significant improvements in understanding and tracking metrics [18]. - In instance 3D tracking tasks, IGGT achieves tracking IOU and success rates of 70% and 90%, respectively, being the only model capable of tracking objects that disappear and reappear [19]. - The model supports multiple applications, including instance spatial tracking, open-vocabulary semantic segmentation, and QA scene grounding, allowing for complex object-centric queries in 3D scenes [23][30].
腾讯开源混元世界模型1.1,视频秒变3D世界,单卡推理仅需1秒
量子位· 2025-10-22 09:12
Core Viewpoint - Tencent has released and open-sourced the Hunyuan World Model 1.1, a unified end-to-end 3D reconstruction model that supports generating 3D worlds from multiple views or videos with high precision and efficiency [1][3][16]. Group 1: Model Features - Hunyuan World Model 1.1 is the industry's first unified feedforward 3D reconstruction model, capable of handling various input modalities and producing multiple outputs simultaneously, achieving state-of-the-art (SOTA) performance [4][18][21]. - The model supports flexible input handling, allowing the integration of camera poses, intrinsic parameters, and depth maps to enhance reconstruction quality [18][20]. - It features a single-card deployment with one-second inference time, significantly faster than traditional methods that may take minutes or hours [22][24]. Group 2: Performance Comparison - In comparisons with Meta's MapAnything and AnySplat models, Hunyuan World Model 1.1 demonstrated superior surface smoothness and scene regularity in 3D point cloud reconstruction tasks [11][12][14]. - The model excels in both geometric accuracy and detail restoration, providing more stable and realistic scene reconstructions compared to its competitors [14][15]. Group 3: User Accessibility - The model is fully open-sourced, allowing developers to clone it from GitHub and deploy it locally, while ordinary users can access it online to generate 3D scenes from uploaded images or videos [34][37]. - The technology aims to democratize 3D reconstruction, making it accessible for anyone to create professional-level 3D scenes in seconds [37].
哈工大&理想PAGS:自驾闭环仿真新SOTA!
自动驾驶之心· 2025-10-17 16:04
Core Viewpoint - The article discusses the advancements in 3D scene reconstruction for dynamic urban environments, emphasizing the introduction of the PAGS method, which addresses the inefficiencies in resource allocation by prioritizing semantic elements critical for driving safety [1][22]. Research Background and Core Issues - Dynamic large-scale urban environment 3D reconstruction is essential for autonomous driving systems, supporting simulation testing and digital twin applications [1]. - Existing methods face a bottleneck in resource allocation, failing to distinguish between critical elements (e.g., pedestrians, vehicles) and non-critical elements (e.g., distant buildings) [1]. - This leads to wasted computational resources on non-critical details while compromising the fidelity of critical object details [1]. Core Method Design - PAGS introduces a task-aware semantic priority embedded in the reconstruction and rendering process, consisting of three main modules: 1. Combination of Gaussian scene representation [4]. 2. Semantic-guided pruning [5]. 3. Priority-driven rendering pipeline [6]. Experimental Validation and Results Analysis - Experiments were conducted on the Waymo and KITTI datasets, measuring reconstruction fidelity and efficiency against mainstream methods [12]. - Quantitative results show that PAGS achieves a PSNR of 34.63 and an FPS of 353, significantly outperforming other methods in both fidelity and speed [17][22]. - The model size is 530 MB with a VRAM usage of 6.1 GB, making it suitable for in-vehicle hardware [17]. Conclusion - PAGS effectively breaks the inherent trade-off between fidelity and efficiency in dynamic driving scene 3D reconstruction through semantic-guided resource allocation and priority-driven rendering acceleration [22]. - The method ensures computational resources are focused on critical objects, enhancing rendering speed while maintaining high fidelity [23].
自动驾驶基础模型应该以能力为导向,而不仅是局限于方法本身
自动驾驶之心· 2025-09-16 23:33
Core Insights - The article discusses the transformative impact of foundational models on the autonomous driving perception domain, shifting from task-specific deep learning models to versatile architectures trained on vast and diverse datasets [2][4] - It introduces a new classification framework focusing on four core capabilities essential for robust performance in dynamic driving environments: general knowledge, spatial understanding, multi-sensor robustness, and temporal reasoning [2][5] Group 1: Introduction and Background - Autonomous driving perception is crucial for enabling vehicles to interpret their surroundings in real-time, involving key tasks such as object detection, semantic segmentation, and tracking [3] - Traditional models, designed for specific tasks, exhibit limited scalability and poor generalization, particularly in "long-tail scenarios" where rare but critical events occur [3][4] Group 2: Foundational Models - Foundational models, developed through self-supervised or unsupervised learning strategies, leverage large-scale datasets to learn general representations applicable across various downstream tasks [4][5] - These models demonstrate significant advantages in autonomous driving due to their inherent generalization capabilities, efficient transfer learning, and reduced reliance on labeled datasets [4][5] Group 3: Key Capabilities - The four key dimensions for designing foundational models tailored for autonomous driving perception are: 1. General Knowledge: Ability to adapt to a wide range of driving scenarios, including rare situations [5][6] 2. Spatial Understanding: Deep comprehension of 3D spatial structures and relationships [5][6] 3. Multi-Sensor Robustness: Maintaining high performance under varying environmental conditions and sensor failures [5][6] 4. Temporal Reasoning: Capturing temporal dependencies and predicting future states of the environment [6] Group 4: Integration and Challenges - The article outlines three mechanisms for integrating foundational models into autonomous driving technology stacks: feature-level distillation, pseudo-label supervision, and direct integration [37][40] - It highlights the challenges faced in deploying these models, including the need for effective domain adaptation, addressing hallucination risks, and ensuring efficiency in real-time applications [58][61] Group 5: Future Directions - The article emphasizes the importance of advancing research in foundational models to enhance their safety and effectiveness in autonomous driving systems, addressing current limitations and exploring new methodologies [2][5][58]
ICCV 2025 | RobustSplat: 解耦致密化与动态的抗瞬态3DGS三维重建
机器之心· 2025-08-19 09:45
Core Viewpoint - The article discusses the RobustSplat method, which addresses the challenges of 3D Gaussian Splatting (3DGS) in rendering dynamic objects by introducing a delayed Gaussian growth strategy and a scale-cascade mask guidance method to reduce rendering artifacts caused by transient objects [2][21]. Research Motivation - The motivation stems from understanding the dual role of Gaussian densification in 3DGS, which enhances scene detail but also risks overfitting dynamic areas, leading to artifacts and scene distortion. The goal is to balance static structure representation and dynamic interference suppression [6][8]. Methodology - **Transient Mask Estimation**: Utilizes a Mask MLP with two linear layers to output pixel-wise transient masks, distinguishing between transient and static regions [9]. - **Feature Selection**: DINOv2 features are chosen for their balance of semantic consistency, noise resistance, and computational efficiency, outperforming other feature sets like Stable Diffusion and SAM [10]. - **Supervision Design**: Combines image residual loss and feature cosine similarity loss for mask MLP optimization, enhancing dynamic area recognition [12]. - **Delayed Gaussian Growth Strategy**: This core strategy postpones the densification process to prioritize static scene structure optimization, reducing the risk of misclassifying static areas as transient [13]. - **Scale-Cascade Mask Guidance**: Initially estimates transient masks using low-resolution features, then transitions to high-resolution supervision for more accurate mask predictions [14]. Experimental Results - Experiments on NeRF On-the-go and RobustNeRF datasets show that RobustSplat outperforms baseline methods like 3DGS, SpotLessSplats, and WildGaussians across various metrics, including PSNR, SSIM, and LPIPS [16][21]. Summary - RobustSplat effectively reduces rendering artifacts caused by transient objects through its innovative strategies, demonstrating superior performance in complex scenes with dynamic elements while preserving detail [19][21].
随手拍照片就能VR云旅游!无位姿、稀疏图像条件下实现稳定3D重建和新视角合成|港科广
量子位· 2025-07-31 04:23
Core Viewpoint - A new algorithm, RegGS, developed by the Hong Kong University of Science and Technology (Guangzhou), can reconstruct 3D models from sparse 2D images without precise camera positioning, achieving centimeter-level accuracy suitable for VR applications [2][4]. Group 1: Methodology - RegGS combines feed-forward Gaussian representation with structural registration to address the challenges of sparse and pose-less images, providing a new pathway for practical 3D reconstruction [6][8]. - The core mechanism involves registering local 3D Gaussian mixture models to gradually build a global 3D scene, avoiding reliance on traditional Structure from Motion (SfM) initialization and requiring fewer input images [8][12]. Group 2: Experimental Results - In experiments on the RE10K and ACID datasets, RegGS outperformed existing mainstream methods across various input frame counts (2×/8×/16×/32×) in metrics such as PSNR, SSIM, and LPIPS [9][12]. Group 3: Applications - RegGS addresses the "sparse + pose-less" problem with significant real-world applications, including: - 3D reconstruction from user-generated content (UGC) videos, which often lack camera parameters [13]. - Drone aerial mapping, demonstrating robustness to large viewpoint variations and low frame rates [13]. - Restoration of historical images/documents, enabling 3D reconstruction from a few photos taken from different angles [13]. - Compared to traditional SfM or Bundle Adjustment methods, RegGS requires less structural input and is more feasible for unstructured data applications [13]. Group 4: Limitations and Future Directions - The performance and efficiency of RegGS are currently limited by the quality of the upstream feed-forward model and the computational cost of the MW2 distance calculation, indicating areas for future optimization [13].
李飞飞空间智能独角兽开源底层技术!AI生成3D世界在所有设备流畅运行空间智能的“着色器”来了
量子位· 2025-06-03 04:26
Core Viewpoint - World Labs, co-founded by Fei-Fei Li, has open-sourced a core technology called Forge, a real-time 3D Gaussian Splatting renderer that operates seamlessly across various devices, including desktops, low-power mobile devices, and XR [1][6]. Group 1: Technology Overview - Forge is a web-based 3D Gaussian Splatting renderer that integrates with three.js, enabling fully dynamic and programmable Gaussian splatting [2]. - The underlying design of Forge is optimized for GPU, serving a role similar to traditional 3D graphics components known as "shaders" [3]. - The technology allows developers to handle AI-generated 3D worlds as easily as manipulating triangle meshes, according to Ben Mildenhall, co-founder of World Labs [5]. Group 2: Features and Capabilities - Forge requires minimal code to start and run, supporting multiple splat objects, cameras, and real-time animations/edits [4]. - It is designed as a programmable 3D Gaussian Splatting engine, providing unprecedented control over the generation, animation, and rendering of 3D Gaussian splats [8]. - The renderer employs a painter's algorithm for sorting splats, which is a core aspect of its design [13]. Group 3: Rendering Process - The key component managing the rendering process is ForgeRenderer, which compiles a complete list of splats in a three.js scene and determines the drawing order using an efficient bucket sort algorithm [14]. - Forge supports multi-view rendering by generating additional ForgeViewpoint objects, allowing for simultaneous rendering from different perspectives [15]. Group 4: Future Plans - World Labs aims to elevate multimodal AI from 2D pixel planes to full 3D worlds, with plans to launch its first product in 2025 [17]. - The company intends to develop tools beneficial for professionals such as artists, designers, developers, filmmakers, and engineers, targeting a wide range of customers from video game developers to film studios [17].