Workflow
场景理解
icon
Search documents
首个实例理解3D重建模型,NTU&阶越提出基于实例解耦的3D重建模型,助理场景理解
3 6 Ke· 2025-10-31 08:28
Core Insights - The article discusses the challenges AI faces in perceiving 3D geometry and semantic content, highlighting the limitations of traditional methods that separate 3D reconstruction from spatial understanding. A new approach, IGGT (Instance-Grounded Geometry Transformer), integrates these aspects into a unified model for improved performance in various tasks [1]. Group 1: IGGT Model Development - IGGT is an end-to-end large unified Transformer that combines spatial reconstruction and instance-level contextual understanding in a single model [1]. - The model is built on a new large-scale dataset, InsScene-15K, which includes 15,000 scenes and 200 million images, featuring high-quality, 3D-consistent instance-level masks [2]. - IGGT introduces the "Instance-Grounded Scene Understanding" paradigm, allowing it to operate independently of specific visual language models (VLMs) and enabling seamless integration with various VLMs and large multimodal models (LMMs) [3]. Group 2: Applications and Capabilities - The unified representation of IGGT significantly expands its downstream capabilities, supporting spatial tracking, open vocabulary segmentation, and scene question answering (QA) [4]. - The model's architecture includes a Geometry Head for predicting camera parameters and depth maps, and an Instance Head for decoding instance features, enhancing spatial perception [11][18]. - IGGT achieves high performance in instance 3D tracking tasks, with tracking IOU and success rates reaching 70% and 90%, respectively, making it the only model capable of successfully tracking objects that disappear and reappear [16]. Group 3: Data Collection and Processing - The InsScene-15K dataset is constructed through a novel data management process that integrates three different data sources, including synthetic data, real-world video capture, and RGBD capture [6][9][10]. - The synthetic data is generated in simulated environments, providing perfect accuracy for segmentation masks, while real-world data is processed through a custom pipeline to ensure temporal consistency [8][9]. Group 4: Performance Comparison - IGGT outperforms existing models in reconstruction, understanding, and tracking tasks, with significant improvements in understanding and tracking metrics compared to other models [16]. - The model's instance masks can serve as prompts for VLMs, enabling open vocabulary semantic segmentation and facilitating complex object-centric question answering tasks [19][24].
首个实例理解3D重建模型!NTU&阶越提出基于实例解耦的3D重建模型,助理场景理解
量子位· 2025-10-31 04:09
Core Insights - The article discusses the challenges AI faces in simultaneously understanding the geometric structure and semantic content of 3D worlds, which humans naturally perceive. Traditional methods separate 3D reconstruction from spatial understanding, leading to errors and limited generalization. The introduction of IGGT (Instance-Grounded Geometry Transformer) aims to unify these processes in a single model [1][2]. Group 1: IGGT Framework - IGGT is an end-to-end unified framework that integrates spatial reconstruction and instance-level contextual understanding within a single model [2]. - A new large-scale dataset, InsScene-15K, has been created, containing 15,000 scenes and 200 million images, with high-quality, 3D-consistent instance-level masks [2][5]. - The model introduces the "Instance-Grounded Scene Understanding" paradigm, allowing it to generate instance masks that can seamlessly integrate with various Vision Language Models (VLMs) and Language Models (LMMs) [2][18]. Group 2: Data Collection Process - The InsScene-15K dataset is constructed through a novel data management process driven by SAM2, integrating three different data sources [5]. - Synthetic data is generated in simulated environments, providing perfect accuracy for RGB images, depth maps, camera poses, and object-level segmentation masks [8]. - Real-world video collection involves a custom SAM2 pipeline that generates dense initial mask proposals and propagates these masks over time, ensuring high temporal consistency [9]. - Real-world RGBD data collection uses a mask optimization process to enhance the quality of 2D masks while maintaining 3D ID consistency [10]. Group 3: Model Architecture - The IGGT model architecture consists of a unified transformer that processes image tokens through attention modules to create a powerful unified token representation [14]. - It features dual decoding heads for geometry and instance predictions, employing a cross-modal fusion block to enhance spatial perception [17]. - The model utilizes a multi-view contrastive loss to learn 3D-consistent instance features from 2D inputs [15]. Group 4: Performance and Applications - IGGT is the first model capable of simultaneously performing reconstruction, understanding, and tracking tasks, showing significant improvements in understanding and tracking metrics [18]. - In instance 3D tracking tasks, IGGT achieves tracking IOU and success rates of 70% and 90%, respectively, being the only model capable of tracking objects that disappear and reappear [19]. - The model supports multiple applications, including instance spatial tracking, open-vocabulary semantic segmentation, and QA scene grounding, allowing for complex object-centric queries in 3D scenes [23][30].
刚刚,UCLA周博磊也加入了一家机器人公司
具身智能之心· 2025-10-16 00:03
Core Insights - Coco Robotics has appointed Bolei Zhou, a UCLA associate professor, as the Chief AI Scientist to lead the newly established Physical AI Lab, focusing on autonomous sidewalk delivery solutions [2][3][5] - The company aims to achieve full automation in last-mile delivery, leveraging the extensive operational data collected over the past five years to enhance their robotic systems [4][5][7] - The Physical AI Lab is an independent research initiative, separate from Coco Robotics' collaboration with OpenAI, and will focus on improving the company's automation capabilities and operational efficiency [8][9] Group 1: Company Overview - Coco Robotics, founded in 2020, specializes in last-mile delivery robotics and initially relied on teleoperators to navigate obstacles [4] - The company has accumulated millions of miles of data in complex urban environments, which is crucial for training reliable AI systems [7] - The goal is to reduce overall delivery costs while improving service quality for businesses and consumers [9] Group 2: Leadership and Research Focus - Bolei Zhou's expertise in machine perception and intelligent decision-making aligns with Coco Robotics' objectives, particularly in micromobility [7][8] - Zhou has a strong academic background, having published over 100 papers with significant citations, particularly in explainable AI and scene understanding [12][14] - The Physical AI Lab will utilize the research findings to enhance Coco's local models and potentially share insights with operational cities to improve infrastructure [9] Group 3: Data Utilization and Future Plans - Coco Robotics plans to use the data collected to improve its automation levels and operational efficiency, rather than selling it to competitors [9] - The success of the Physical AI Lab will be measured by the company's ability to provide high-quality services at lower costs, which could drive significant growth in the ecosystem [9]