3D场景理解
Search documents
SceneSplat: 基于3DGS的场景理解和视觉语言预训练,让3D高斯「听懂人话」的一跃
机器之心· 2025-09-07 08:21
Core Insights - The article introduces SceneSplat, the first end-to-end large-scale 3D indoor scene understanding method that operates natively on 3D Gaussian Scenes (3DGS) [2][6] - A self-supervised learning scheme is proposed to unlock rich 3D feature learning from unlabelled scenes, addressing the lack of models that can independently handle 3D data for semantic learning [2][6] - The SceneSplat-7K dataset is created, consisting of 7,916 scenes sourced from seven existing datasets, enabling effective training and testing of the SceneSplat model [2][6] Dataset Construction - SceneSplat-7K includes 7,916 processed 3DGS scenes and a total of 11.27 billion Gaussian points, with an average of approximately 1.42 million points per scene [6][7] - The dataset's construction required computational resources equivalent to 150 days of running on L4 GPUs, ensuring high reconstruction quality with a PSNR of 29.64 dB and average Depth-L1 of 0.035 m [6][7] Semantic Annotation - A stable and fast system is utilized for annotating semantic information in 3DGS, employing SAMv2 for object-level segmentation and SigLIP2 for extracting visual-language features [8][10] - The pre-trained encoder learns rich semantic representations solely based on 3DGS parameters and neighborhood information, eliminating the need for 2D fusion during inference [8][10] Training Methodology - Two training routes are provided: visual-language pre-training for labelled data and self-supervised training for unlabelled data, maximizing the learning potential of unlabelled scenes [12][14] - The model employs a hierarchical Transformer architecture, using Gaussian tokens and neighborhood attention to achieve effective semantic vector regression [15] Experimental Results - The SceneSplat method achieves state-of-the-art (SOTA) results in zero-shot semantic segmentation on datasets like ScanNet200, ScanNet++, and Matterport3D [21][22] - Quantitative experiments demonstrate significant improvements in mean Intersection over Union (mIoU) and mean Accuracy (mAcc) across various datasets, showcasing the model's robustness [22][23] Future Work - The SceneSplat-7K dataset is being expanded to SceneSplat-49K, with ongoing benchmarking of 3DGS and semantic integration across multiple datasets [31]
特斯联全新研究成果聚焦3D场景理解,获IEEE T-PAMI收录
IPO早知道· 2025-05-13 01:55
Core Insights - The article discusses a new research achievement by Dr. Shao Ling and his team at Tesiun, introducing a framework called Laser for efficient language-guided segmentation, enhancing 3D scene understanding in real-time semantic parsing applications [2]. Group 1: Applications in Autonomous Driving and Robotics - The Laser framework is particularly beneficial for autonomous vehicles and mobile robots, enabling them to quickly understand the 3D structure and semantic information of their surroundings for safe navigation and decision-making. The training time for Laser is only 11 minutes, compared to 158 minutes for traditional methods, allowing for rapid construction of 3D semantic maps. Additionally, the low-rank attention mechanism accurately identifies fine-grained features like road edges and lane markings, reducing misjudgments caused by ambiguous boundaries [2]. Group 2: Applications in Augmented Reality (AR) and Virtual Reality (VR) - In AR and VR, the framework ensures precise overlay of virtual objects onto real scenes, requiring a deep understanding of 3D spatial semantics. It aligns virtual objects with real scene annotations (e.g., walls, tables) from different perspectives, preventing visual discrepancies. The framework can also distinguish between similar colored objects, enhancing the rational placement of virtual items. When combined with 3D Gaussian rendering technology, it enables real-time semantic AR effects [4]. Group 3: Applications in Urban Planning and Architectural Modeling - In urban digital modeling, the framework supports semantic labeling of buildings, vegetation, and public facilities to assist planning decisions. It allows for open vocabulary segmentation of rare objects (e.g., ancient architectural decorations, special signs), enriching the coverage of data annotations. Furthermore, using Laser, 3D models with semantics can be generated from multi-view images without the need for manual annotation [5].
无需大量标注也能理解3D!新研究登上ICLR 2025 Spotlight
量子位· 2025-03-07 07:12
Core Insights - The article presents a novel multimodal Few-shot 3D segmentation approach that allows models to accurately segment 3D scenes with minimal labeled samples [1][6][36] - This method integrates text, 2D, and 3D information without incurring additional labeling costs, enabling rapid adaptation to new categories [2][14][36] Group 1: Importance of 3D Scene Understanding - Accurate understanding of 3D scenes is crucial for applications like humanoid robots, VR/AR, and autonomous vehicles [3][7] - Traditional supervised models require extensive labeled 3D data, leading to high time and resource costs [4][9] Group 2: Few-shot Learning and Its Limitations - Few-shot learning is a promising solution but has been limited to unimodal point cloud data, overlooking the potential of multimodal information [5][13] - The new research fills this gap by proposing a multimodal Few-shot 3D segmentation setting [6][36] Group 3: The MM-FSS Model - The proposed model, MultiModal Few-Shot SegNet (MM-FSS), effectively enhances learning and generalization capabilities for new categories by leveraging multimodal information [15][16][36] - MM-FSS incorporates Intermodal Feature (IF) Head and Unimodal Feature (UF) Head for feature extraction, aligning 3D point cloud features with 2D visual features [22][23] Group 4: Methodology and Innovations - The model undergoes a pre-training phase for cross-modal alignment, ensuring that it can utilize learned intermodal features during Few-shot learning without requiring additional 2D inputs [23][24] - The introduction of Multimodal Correlation Fusion (MCF) and Multimodal Semantic Fusion (MSF) modules allows for effective aggregation of visual and semantic information [25][27] Group 5: Performance and Results - Experiments on standard FS-PCS datasets demonstrate that MM-FSS achieves superior performance across various Few-shot tasks, outperforming existing methods [34][35] - The model shows significant improvements in new class segmentation and generalization capabilities [35][36] Group 6: Future Directions - The research opens up new avenues for enhancing performance, optimizing training and inference efficiency, and deeper utilization of modal information [38][37]