SceneSplat: 基于3DGS的场景理解和视觉语言预训练,让3D高斯「听懂人话」的一跃
机器之心·2025-09-07 08:21

Core Insights - The article introduces SceneSplat, the first end-to-end large-scale 3D indoor scene understanding method that operates natively on 3D Gaussian Scenes (3DGS) [2][6] - A self-supervised learning scheme is proposed to unlock rich 3D feature learning from unlabelled scenes, addressing the lack of models that can independently handle 3D data for semantic learning [2][6] - The SceneSplat-7K dataset is created, consisting of 7,916 scenes sourced from seven existing datasets, enabling effective training and testing of the SceneSplat model [2][6] Dataset Construction - SceneSplat-7K includes 7,916 processed 3DGS scenes and a total of 11.27 billion Gaussian points, with an average of approximately 1.42 million points per scene [6][7] - The dataset's construction required computational resources equivalent to 150 days of running on L4 GPUs, ensuring high reconstruction quality with a PSNR of 29.64 dB and average Depth-L1 of 0.035 m [6][7] Semantic Annotation - A stable and fast system is utilized for annotating semantic information in 3DGS, employing SAMv2 for object-level segmentation and SigLIP2 for extracting visual-language features [8][10] - The pre-trained encoder learns rich semantic representations solely based on 3DGS parameters and neighborhood information, eliminating the need for 2D fusion during inference [8][10] Training Methodology - Two training routes are provided: visual-language pre-training for labelled data and self-supervised training for unlabelled data, maximizing the learning potential of unlabelled scenes [12][14] - The model employs a hierarchical Transformer architecture, using Gaussian tokens and neighborhood attention to achieve effective semantic vector regression [15] Experimental Results - The SceneSplat method achieves state-of-the-art (SOTA) results in zero-shot semantic segmentation on datasets like ScanNet200, ScanNet++, and Matterport3D [21][22] - Quantitative experiments demonstrate significant improvements in mean Intersection over Union (mIoU) and mean Accuracy (mAcc) across various datasets, showcasing the model's robustness [22][23] Future Work - The SceneSplat-7K dataset is being expanded to SceneSplat-49K, with ongoing benchmarking of 3DGS and semantic integration across multiple datasets [31]