实例解耦
Search documents
首个实例理解3D重建模型,NTU&阶越提出基于实例解耦的3D重建模型,助理场景理解
3 6 Ke· 2025-10-31 08:28
Core Insights - The article discusses the challenges AI faces in perceiving 3D geometry and semantic content, highlighting the limitations of traditional methods that separate 3D reconstruction from spatial understanding. A new approach, IGGT (Instance-Grounded Geometry Transformer), integrates these aspects into a unified model for improved performance in various tasks [1]. Group 1: IGGT Model Development - IGGT is an end-to-end large unified Transformer that combines spatial reconstruction and instance-level contextual understanding in a single model [1]. - The model is built on a new large-scale dataset, InsScene-15K, which includes 15,000 scenes and 200 million images, featuring high-quality, 3D-consistent instance-level masks [2]. - IGGT introduces the "Instance-Grounded Scene Understanding" paradigm, allowing it to operate independently of specific visual language models (VLMs) and enabling seamless integration with various VLMs and large multimodal models (LMMs) [3]. Group 2: Applications and Capabilities - The unified representation of IGGT significantly expands its downstream capabilities, supporting spatial tracking, open vocabulary segmentation, and scene question answering (QA) [4]. - The model's architecture includes a Geometry Head for predicting camera parameters and depth maps, and an Instance Head for decoding instance features, enhancing spatial perception [11][18]. - IGGT achieves high performance in instance 3D tracking tasks, with tracking IOU and success rates reaching 70% and 90%, respectively, making it the only model capable of successfully tracking objects that disappear and reappear [16]. Group 3: Data Collection and Processing - The InsScene-15K dataset is constructed through a novel data management process that integrates three different data sources, including synthetic data, real-world video capture, and RGBD capture [6][9][10]. - The synthetic data is generated in simulated environments, providing perfect accuracy for segmentation masks, while real-world data is processed through a custom pipeline to ensure temporal consistency [8][9]. Group 4: Performance Comparison - IGGT outperforms existing models in reconstruction, understanding, and tracking tasks, with significant improvements in understanding and tracking metrics compared to other models [16]. - The model's instance masks can serve as prompts for VLMs, enabling open vocabulary semantic segmentation and facilitating complex object-centric question answering tasks [19][24].
首个实例理解3D重建模型!NTU&阶越提出基于实例解耦的3D重建模型,助理场景理解
量子位· 2025-10-31 04:09
Core Insights - The article discusses the challenges AI faces in simultaneously understanding the geometric structure and semantic content of 3D worlds, which humans naturally perceive. Traditional methods separate 3D reconstruction from spatial understanding, leading to errors and limited generalization. The introduction of IGGT (Instance-Grounded Geometry Transformer) aims to unify these processes in a single model [1][2]. Group 1: IGGT Framework - IGGT is an end-to-end unified framework that integrates spatial reconstruction and instance-level contextual understanding within a single model [2]. - A new large-scale dataset, InsScene-15K, has been created, containing 15,000 scenes and 200 million images, with high-quality, 3D-consistent instance-level masks [2][5]. - The model introduces the "Instance-Grounded Scene Understanding" paradigm, allowing it to generate instance masks that can seamlessly integrate with various Vision Language Models (VLMs) and Language Models (LMMs) [2][18]. Group 2: Data Collection Process - The InsScene-15K dataset is constructed through a novel data management process driven by SAM2, integrating three different data sources [5]. - Synthetic data is generated in simulated environments, providing perfect accuracy for RGB images, depth maps, camera poses, and object-level segmentation masks [8]. - Real-world video collection involves a custom SAM2 pipeline that generates dense initial mask proposals and propagates these masks over time, ensuring high temporal consistency [9]. - Real-world RGBD data collection uses a mask optimization process to enhance the quality of 2D masks while maintaining 3D ID consistency [10]. Group 3: Model Architecture - The IGGT model architecture consists of a unified transformer that processes image tokens through attention modules to create a powerful unified token representation [14]. - It features dual decoding heads for geometry and instance predictions, employing a cross-modal fusion block to enhance spatial perception [17]. - The model utilizes a multi-view contrastive loss to learn 3D-consistent instance features from 2D inputs [15]. Group 4: Performance and Applications - IGGT is the first model capable of simultaneously performing reconstruction, understanding, and tracking tasks, showing significant improvements in understanding and tracking metrics [18]. - In instance 3D tracking tasks, IGGT achieves tracking IOU and success rates of 70% and 90%, respectively, being the only model capable of tracking objects that disappear and reappear [19]. - The model supports multiple applications, including instance spatial tracking, open-vocabulary semantic segmentation, and QA scene grounding, allowing for complex object-centric queries in 3D scenes [23][30].