首个实例理解3D重建模型！NTU&阶越提出基于实例解耦的3D重建模型，助理场景理解

Core Insights - The article discusses the challenges AI faces in simultaneously understanding the geometric structure and semantic content of 3D worlds, which humans naturally perceive. Traditional methods separate 3D reconstruction from spatial understanding, leading to errors and limited generalization. The introduction of IGGT (Instance-Grounded Geometry Transformer) aims to unify these processes in a single model [1][2]. Group 1: IGGT Framework - IGGT is an end-to-end unified framework that integrates spatial reconstruction and instance-level contextual understanding within a single model [2]. - A new large-scale dataset, InsScene-15K, has been created, containing 15,000 scenes and 200 million images, with high-quality, 3D-consistent instance-level masks [2][5]. - The model introduces the "Instance-Grounded Scene Understanding" paradigm, allowing it to generate instance masks that can seamlessly integrate with various Vision Language Models (VLMs) and Language Models (LMMs) [2][18]. Group 2: Data Collection Process - The InsScene-15K dataset is constructed through a novel data management process driven by SAM2, integrating three different data sources [5]. - Synthetic data is generated in simulated environments, providing perfect accuracy for RGB images, depth maps, camera poses, and object-level segmentation masks [8]. - Real-world video collection involves a custom SAM2 pipeline that generates dense initial mask proposals and propagates these masks over time, ensuring high temporal consistency [9]. - Real-world RGBD data collection uses a mask optimization process to enhance the quality of 2D masks while maintaining 3D ID consistency [10]. Group 3: Model Architecture - The IGGT model architecture consists of a unified transformer that processes image tokens through attention modules to create a powerful unified token representation [14]. - It features dual decoding heads for geometry and instance predictions, employing a cross-modal fusion block to enhance spatial perception [17]. - The model utilizes a multi-view contrastive loss to learn 3D-consistent instance features from 2D inputs [15]. Group 4: Performance and Applications - IGGT is the first model capable of simultaneously performing reconstruction, understanding, and tracking tasks, showing significant improvements in understanding and tracking metrics [18]. - In instance 3D tracking tasks, IGGT achieves tracking IOU and success rates of 70% and 90%, respectively, being the only model capable of tracking objects that disappear and reappear [19]. - The model supports multiple applications, including instance spatial tracking, open-vocabulary semantic segmentation, and QA scene grounding, allowing for complex object-centric queries in 3D scenes [23][30].