3D重建
Search documents
独家对话Remy创始人王正男:一个超过微信下载量的爆款应用是如何诞生的
虎嗅APP· 2026-02-14 09:18
以下文章来源于AGI接口 ,作者陈伊凡 AGI接口 . AI卷起的财富风暴。 9天100万用户,20人团队 的"DeepSeek时刻" 出品|虎嗅科技组 作者|陈伊凡、李一飞 编辑|苗正卿 头图|由KIRI提供 "AI 原生 100" 是虎嗅科技组推出针对 AI 原生创新栏目,这是本系列的第「 44 」篇文章。 最近几天,KIRI创始人王正男又一次站在了算力飓风的风眼。 2月10日,KIRI旗下的APP——Remy上线了一款新的功能,功能上线之后,流量暴涨,每10分钟处理任务 200次,每个大场景3D重建需要40分钟左右,这意味着Remy每时每刻都需要大概800张显卡没有间歇的处 理来自全国各地的3D重建任务。 将时钟拨回2025年10月22日,Remy在国内首次上线。如果用一个词形容随后的96小时,那就是"失控"。 仅用9天,Remy的用户量便冲破100万大关。作为参照:即便是备受瞩目的Sora 2,完成这一里程碑也花费 了5天。在发布后的前三天里, 这款应用的下载量甚至一度将微信甩在身后 。 与此同时,一个酷炫的3D实景视频在社交网络病毒般传播,抖音播放量突破十亿。这正是王正男团队借鉴 Insta360思 ...
分割一切并不够,还要3D重建一切,SAM 3D来了
具身智能之心· 2025-11-21 00:04
Core Viewpoint - Meta has launched significant updates with the introduction of SAM 3D and SAM 3, enhancing the understanding of images in 3D and providing advanced capabilities for object detection, segmentation, and tracking in images and videos [2][6][40]. Group 1: SAM 3D Overview - SAM 3D is the latest addition to the SAM series, featuring two models: SAM 3D Objects and SAM 3D Body, both demonstrating state-of-the-art performance in converting 2D images into detailed 3D reconstructions [2][4]. - SAM 3D Objects allows users to generate 3D models from a single image, overcoming limitations of traditional 3D modeling that often relies on isolated or synthetic data [11][15]. - Meta has annotated nearly 1 million real-world images, generating approximately 3.14 million 3D meshes, utilizing a scalable data engine to enhance the quality and quantity of 3D data [20][26]. Group 2: SAM 3D Body - SAM 3D Body focuses on accurate 3D human pose and shape reconstruction from single images, maintaining high-quality performance even in complex scenarios with occlusions and unusual poses [28][30]. - The model is interactive, allowing users to guide and control predictions, enhancing accuracy and usability [29]. - A high-quality training dataset of around 8 million images was created to improve the model's performance across various 3D benchmarks [33]. Group 3: SAM 3 Capabilities - SAM 3 introduces promptable concept segmentation, enabling the model to detect and segment specific concepts based on text or example image prompts, significantly improving its performance in concept recognition [40][42]. - The architecture of SAM 3 builds on previous advancements, utilizing components like the Meta Perception Encoder and DETR for enhanced image recognition and object detection capabilities [42][44]. - SAM 3 achieves a twofold increase in cgF1 scores for concept recognition and maintains near real-time performance for images with over 100 detection targets, completing inference in approximately 30 milliseconds on H200 GPUs [44].
谢赛宁盛赞字节Seed新研究!单Transformer搞定任意视图3D重建
量子位· 2025-11-18 05:02
Core Insights - The article discusses the latest research achievement by ByteDance's Seed team, introducing Depth Anything 3 (DA3), which has received high praise from experts like Xie Saining [1] - DA3 simplifies the process of 3D reconstruction by using a single visual transformer to accurately estimate depth and reconstruct camera positions from various input formats, including single images, multi-view photos, and videos [2][7] Performance Improvements - DA3 has shown significant performance enhancements, with an average increase of 35.7% in camera localization accuracy and a 23.6% improvement in geometric reconstruction accuracy compared to previous models [3] - The model surpasses its predecessor, DA2, in monocular depth estimation [3] Architectural Design - DA3's architecture is designed to be simple yet effective, utilizing a single visual transformer and focusing on two core predictions: depth and light [7] - The model's workflow consists of four main stages, starting with input processing where multi-view images are transformed into feature blocks, integrating camera parameters when available [9] - The core of the model is the Single Transformer (Vanilla DINO), which employs both within-view and cross-view self-attention mechanisms to facilitate perspective transitions across different input formats [9] Training Methodology - DA3 employs a teacher-student distillation strategy, where a more powerful teacher model generates high-quality pseudo-labels from vast datasets, guiding the student model (DA3) during training [13] - This approach allows for the effective use of diverse data while reducing reliance on high-precision annotated data, enabling the model to cover a broader range of scenarios during training [14] Evaluation and Applications - DA3 demonstrates robust performance, accurately estimating camera parameters for each frame in a video and reconstructing camera motion trajectories [16] - The depth maps produced by DA3, when combined with camera positions, yield higher density and lower noise 3D point clouds, significantly improving quality compared to traditional methods [17] - The model can also generate images from unshot angles through perspective completion, showcasing potential applications in virtual tourism and digital twins [19] Team Background - The Depth Anything 3 project is led by Kang Bingyi, a post-95 researcher at ByteDance, with a focus on computer vision and multimodal models [20] - Kang completed his undergraduate studies at Zhejiang University in 2016 and pursued a master's and PhD in artificial intelligence at UC Berkeley and the National University of Singapore [23] - He has previously interned at Facebook AI Research and has collaborated with notable figures in the field [24]
首个实例理解3D重建模型,NTU&阶越提出基于实例解耦的3D重建模型,助理场景理解
3 6 Ke· 2025-10-31 08:28
Core Insights - The article discusses the challenges AI faces in perceiving 3D geometry and semantic content, highlighting the limitations of traditional methods that separate 3D reconstruction from spatial understanding. A new approach, IGGT (Instance-Grounded Geometry Transformer), integrates these aspects into a unified model for improved performance in various tasks [1]. Group 1: IGGT Model Development - IGGT is an end-to-end large unified Transformer that combines spatial reconstruction and instance-level contextual understanding in a single model [1]. - The model is built on a new large-scale dataset, InsScene-15K, which includes 15,000 scenes and 200 million images, featuring high-quality, 3D-consistent instance-level masks [2]. - IGGT introduces the "Instance-Grounded Scene Understanding" paradigm, allowing it to operate independently of specific visual language models (VLMs) and enabling seamless integration with various VLMs and large multimodal models (LMMs) [3]. Group 2: Applications and Capabilities - The unified representation of IGGT significantly expands its downstream capabilities, supporting spatial tracking, open vocabulary segmentation, and scene question answering (QA) [4]. - The model's architecture includes a Geometry Head for predicting camera parameters and depth maps, and an Instance Head for decoding instance features, enhancing spatial perception [11][18]. - IGGT achieves high performance in instance 3D tracking tasks, with tracking IOU and success rates reaching 70% and 90%, respectively, making it the only model capable of successfully tracking objects that disappear and reappear [16]. Group 3: Data Collection and Processing - The InsScene-15K dataset is constructed through a novel data management process that integrates three different data sources, including synthetic data, real-world video capture, and RGBD capture [6][9][10]. - The synthetic data is generated in simulated environments, providing perfect accuracy for segmentation masks, while real-world data is processed through a custom pipeline to ensure temporal consistency [8][9]. Group 4: Performance Comparison - IGGT outperforms existing models in reconstruction, understanding, and tracking tasks, with significant improvements in understanding and tracking metrics compared to other models [16]. - The model's instance masks can serve as prompts for VLMs, enabling open vocabulary semantic segmentation and facilitating complex object-centric question answering tasks [19][24].
首个实例理解3D重建模型!NTU&阶越提出基于实例解耦的3D重建模型,助理场景理解
量子位· 2025-10-31 04:09
Core Insights - The article discusses the challenges AI faces in simultaneously understanding the geometric structure and semantic content of 3D worlds, which humans naturally perceive. Traditional methods separate 3D reconstruction from spatial understanding, leading to errors and limited generalization. The introduction of IGGT (Instance-Grounded Geometry Transformer) aims to unify these processes in a single model [1][2]. Group 1: IGGT Framework - IGGT is an end-to-end unified framework that integrates spatial reconstruction and instance-level contextual understanding within a single model [2]. - A new large-scale dataset, InsScene-15K, has been created, containing 15,000 scenes and 200 million images, with high-quality, 3D-consistent instance-level masks [2][5]. - The model introduces the "Instance-Grounded Scene Understanding" paradigm, allowing it to generate instance masks that can seamlessly integrate with various Vision Language Models (VLMs) and Language Models (LMMs) [2][18]. Group 2: Data Collection Process - The InsScene-15K dataset is constructed through a novel data management process driven by SAM2, integrating three different data sources [5]. - Synthetic data is generated in simulated environments, providing perfect accuracy for RGB images, depth maps, camera poses, and object-level segmentation masks [8]. - Real-world video collection involves a custom SAM2 pipeline that generates dense initial mask proposals and propagates these masks over time, ensuring high temporal consistency [9]. - Real-world RGBD data collection uses a mask optimization process to enhance the quality of 2D masks while maintaining 3D ID consistency [10]. Group 3: Model Architecture - The IGGT model architecture consists of a unified transformer that processes image tokens through attention modules to create a powerful unified token representation [14]. - It features dual decoding heads for geometry and instance predictions, employing a cross-modal fusion block to enhance spatial perception [17]. - The model utilizes a multi-view contrastive loss to learn 3D-consistent instance features from 2D inputs [15]. Group 4: Performance and Applications - IGGT is the first model capable of simultaneously performing reconstruction, understanding, and tracking tasks, showing significant improvements in understanding and tracking metrics [18]. - In instance 3D tracking tasks, IGGT achieves tracking IOU and success rates of 70% and 90%, respectively, being the only model capable of tracking objects that disappear and reappear [19]. - The model supports multiple applications, including instance spatial tracking, open-vocabulary semantic segmentation, and QA scene grounding, allowing for complex object-centric queries in 3D scenes [23][30].
腾讯开源混元世界模型1.1,视频秒变3D世界,单卡推理仅需1秒
量子位· 2025-10-22 09:12
Core Viewpoint - Tencent has released and open-sourced the Hunyuan World Model 1.1, a unified end-to-end 3D reconstruction model that supports generating 3D worlds from multiple views or videos with high precision and efficiency [1][3][16]. Group 1: Model Features - Hunyuan World Model 1.1 is the industry's first unified feedforward 3D reconstruction model, capable of handling various input modalities and producing multiple outputs simultaneously, achieving state-of-the-art (SOTA) performance [4][18][21]. - The model supports flexible input handling, allowing the integration of camera poses, intrinsic parameters, and depth maps to enhance reconstruction quality [18][20]. - It features a single-card deployment with one-second inference time, significantly faster than traditional methods that may take minutes or hours [22][24]. Group 2: Performance Comparison - In comparisons with Meta's MapAnything and AnySplat models, Hunyuan World Model 1.1 demonstrated superior surface smoothness and scene regularity in 3D point cloud reconstruction tasks [11][12][14]. - The model excels in both geometric accuracy and detail restoration, providing more stable and realistic scene reconstructions compared to its competitors [14][15]. Group 3: User Accessibility - The model is fully open-sourced, allowing developers to clone it from GitHub and deploy it locally, while ordinary users can access it online to generate 3D scenes from uploaded images or videos [34][37]. - The technology aims to democratize 3D reconstruction, making it accessible for anyone to create professional-level 3D scenes in seconds [37].
哈工大&理想PAGS:自驾闭环仿真新SOTA!
自动驾驶之心· 2025-10-17 16:04
Core Viewpoint - The article discusses the advancements in 3D scene reconstruction for dynamic urban environments, emphasizing the introduction of the PAGS method, which addresses the inefficiencies in resource allocation by prioritizing semantic elements critical for driving safety [1][22]. Research Background and Core Issues - Dynamic large-scale urban environment 3D reconstruction is essential for autonomous driving systems, supporting simulation testing and digital twin applications [1]. - Existing methods face a bottleneck in resource allocation, failing to distinguish between critical elements (e.g., pedestrians, vehicles) and non-critical elements (e.g., distant buildings) [1]. - This leads to wasted computational resources on non-critical details while compromising the fidelity of critical object details [1]. Core Method Design - PAGS introduces a task-aware semantic priority embedded in the reconstruction and rendering process, consisting of three main modules: 1. Combination of Gaussian scene representation [4]. 2. Semantic-guided pruning [5]. 3. Priority-driven rendering pipeline [6]. Experimental Validation and Results Analysis - Experiments were conducted on the Waymo and KITTI datasets, measuring reconstruction fidelity and efficiency against mainstream methods [12]. - Quantitative results show that PAGS achieves a PSNR of 34.63 and an FPS of 353, significantly outperforming other methods in both fidelity and speed [17][22]. - The model size is 530 MB with a VRAM usage of 6.1 GB, making it suitable for in-vehicle hardware [17]. Conclusion - PAGS effectively breaks the inherent trade-off between fidelity and efficiency in dynamic driving scene 3D reconstruction through semantic-guided resource allocation and priority-driven rendering acceleration [22]. - The method ensures computational resources are focused on critical objects, enhancing rendering speed while maintaining high fidelity [23].
自动驾驶基础模型应该以能力为导向,而不仅是局限于方法本身
自动驾驶之心· 2025-09-16 23:33
Core Insights - The article discusses the transformative impact of foundational models on the autonomous driving perception domain, shifting from task-specific deep learning models to versatile architectures trained on vast and diverse datasets [2][4] - It introduces a new classification framework focusing on four core capabilities essential for robust performance in dynamic driving environments: general knowledge, spatial understanding, multi-sensor robustness, and temporal reasoning [2][5] Group 1: Introduction and Background - Autonomous driving perception is crucial for enabling vehicles to interpret their surroundings in real-time, involving key tasks such as object detection, semantic segmentation, and tracking [3] - Traditional models, designed for specific tasks, exhibit limited scalability and poor generalization, particularly in "long-tail scenarios" where rare but critical events occur [3][4] Group 2: Foundational Models - Foundational models, developed through self-supervised or unsupervised learning strategies, leverage large-scale datasets to learn general representations applicable across various downstream tasks [4][5] - These models demonstrate significant advantages in autonomous driving due to their inherent generalization capabilities, efficient transfer learning, and reduced reliance on labeled datasets [4][5] Group 3: Key Capabilities - The four key dimensions for designing foundational models tailored for autonomous driving perception are: 1. General Knowledge: Ability to adapt to a wide range of driving scenarios, including rare situations [5][6] 2. Spatial Understanding: Deep comprehension of 3D spatial structures and relationships [5][6] 3. Multi-Sensor Robustness: Maintaining high performance under varying environmental conditions and sensor failures [5][6] 4. Temporal Reasoning: Capturing temporal dependencies and predicting future states of the environment [6] Group 4: Integration and Challenges - The article outlines three mechanisms for integrating foundational models into autonomous driving technology stacks: feature-level distillation, pseudo-label supervision, and direct integration [37][40] - It highlights the challenges faced in deploying these models, including the need for effective domain adaptation, addressing hallucination risks, and ensuring efficiency in real-time applications [58][61] Group 5: Future Directions - The article emphasizes the importance of advancing research in foundational models to enhance their safety and effectiveness in autonomous driving systems, addressing current limitations and exploring new methodologies [2][5][58]
ICCV 2025 | RobustSplat: 解耦致密化与动态的抗瞬态3DGS三维重建
机器之心· 2025-08-19 09:45
Core Viewpoint - The article discusses the RobustSplat method, which addresses the challenges of 3D Gaussian Splatting (3DGS) in rendering dynamic objects by introducing a delayed Gaussian growth strategy and a scale-cascade mask guidance method to reduce rendering artifacts caused by transient objects [2][21]. Research Motivation - The motivation stems from understanding the dual role of Gaussian densification in 3DGS, which enhances scene detail but also risks overfitting dynamic areas, leading to artifacts and scene distortion. The goal is to balance static structure representation and dynamic interference suppression [6][8]. Methodology - **Transient Mask Estimation**: Utilizes a Mask MLP with two linear layers to output pixel-wise transient masks, distinguishing between transient and static regions [9]. - **Feature Selection**: DINOv2 features are chosen for their balance of semantic consistency, noise resistance, and computational efficiency, outperforming other feature sets like Stable Diffusion and SAM [10]. - **Supervision Design**: Combines image residual loss and feature cosine similarity loss for mask MLP optimization, enhancing dynamic area recognition [12]. - **Delayed Gaussian Growth Strategy**: This core strategy postpones the densification process to prioritize static scene structure optimization, reducing the risk of misclassifying static areas as transient [13]. - **Scale-Cascade Mask Guidance**: Initially estimates transient masks using low-resolution features, then transitions to high-resolution supervision for more accurate mask predictions [14]. Experimental Results - Experiments on NeRF On-the-go and RobustNeRF datasets show that RobustSplat outperforms baseline methods like 3DGS, SpotLessSplats, and WildGaussians across various metrics, including PSNR, SSIM, and LPIPS [16][21]. Summary - RobustSplat effectively reduces rendering artifacts caused by transient objects through its innovative strategies, demonstrating superior performance in complex scenes with dynamic elements while preserving detail [19][21].
随手拍照片就能VR云旅游!无位姿、稀疏图像条件下实现稳定3D重建和新视角合成|港科广
量子位· 2025-07-31 04:23
Core Viewpoint - A new algorithm, RegGS, developed by the Hong Kong University of Science and Technology (Guangzhou), can reconstruct 3D models from sparse 2D images without precise camera positioning, achieving centimeter-level accuracy suitable for VR applications [2][4]. Group 1: Methodology - RegGS combines feed-forward Gaussian representation with structural registration to address the challenges of sparse and pose-less images, providing a new pathway for practical 3D reconstruction [6][8]. - The core mechanism involves registering local 3D Gaussian mixture models to gradually build a global 3D scene, avoiding reliance on traditional Structure from Motion (SfM) initialization and requiring fewer input images [8][12]. Group 2: Experimental Results - In experiments on the RE10K and ACID datasets, RegGS outperformed existing mainstream methods across various input frame counts (2×/8×/16×/32×) in metrics such as PSNR, SSIM, and LPIPS [9][12]. Group 3: Applications - RegGS addresses the "sparse + pose-less" problem with significant real-world applications, including: - 3D reconstruction from user-generated content (UGC) videos, which often lack camera parameters [13]. - Drone aerial mapping, demonstrating robustness to large viewpoint variations and low frame rates [13]. - Restoration of historical images/documents, enabling 3D reconstruction from a few photos taken from different angles [13]. - Compared to traditional SfM or Bundle Adjustment methods, RegGS requires less structural input and is more feasible for unstructured data applications [13]. Group 4: Limitations and Future Directions - The performance and efficiency of RegGS are currently limited by the quality of the upstream feed-forward model and the computational cost of the MW2 distance calculation, indicating areas for future optimization [13].