首个实例理解3D重建模型，NTU&阶越提出基于实例解耦的3D重建模型，助理场景理解

Core Insights - The article discusses the challenges AI faces in perceiving 3D geometry and semantic content, highlighting the limitations of traditional methods that separate 3D reconstruction from spatial understanding. A new approach, IGGT (Instance-Grounded Geometry Transformer), integrates these aspects into a unified model for improved performance in various tasks [1]. Group 1: IGGT Model Development - IGGT is an end-to-end large unified Transformer that combines spatial reconstruction and instance-level contextual understanding in a single model [1]. - The model is built on a new large-scale dataset, InsScene-15K, which includes 15,000 scenes and 200 million images, featuring high-quality, 3D-consistent instance-level masks [2]. - IGGT introduces the "Instance-Grounded Scene Understanding" paradigm, allowing it to operate independently of specific visual language models (VLMs) and enabling seamless integration with various VLMs and large multimodal models (LMMs) [3]. Group 2: Applications and Capabilities - The unified representation of IGGT significantly expands its downstream capabilities, supporting spatial tracking, open vocabulary segmentation, and scene question answering (QA) [4]. - The model's architecture includes a Geometry Head for predicting camera parameters and depth maps, and an Instance Head for decoding instance features, enhancing spatial perception [11][18]. - IGGT achieves high performance in instance 3D tracking tasks, with tracking IOU and success rates reaching 70% and 90%, respectively, making it the only model capable of successfully tracking objects that disappear and reappear [16]. Group 3: Data Collection and Processing - The InsScene-15K dataset is constructed through a novel data management process that integrates three different data sources, including synthetic data, real-world video capture, and RGBD capture [6][9][10]. - The synthetic data is generated in simulated environments, providing perfect accuracy for segmentation masks, while real-world data is processed through a custom pipeline to ensure temporal consistency [8][9]. Group 4: Performance Comparison - IGGT outperforms existing models in reconstruction, understanding, and tracking tasks, with significant improvements in understanding and tracking metrics compared to other models [16]. - The model's instance masks can serve as prompts for VLMs, enabling open vocabulary semantic segmentation and facilitating complex object-centric question answering tasks [19][24].