单卡训练1亿高斯点，重建25平方公里城市：3DGS内存墙被CPU「外挂」打破了

Core Viewpoint - The article discusses the introduction of CLM (CPU-offloaded Large-scale 3DGS training), a system that allows for city-scale 3D reconstruction using a single consumer-grade GPU, specifically the RTX 4090, by offloading memory-intensive parameters to CPU memory, significantly lowering hardware requirements for large-scale neural rendering [1][21]. Group 1: 3D Gaussian Splatting (3DGS) Challenges - 3DGS has become a crucial technology in neural rendering due to its high-quality output and rendering speed, but it faces significant challenges when applied to complex scenes like urban blocks, primarily due to GPU memory limitations [2]. - A high-precision 3DGS model typically contains tens of millions to over a hundred million Gaussian points, with each point requiring substantial memory for parameters, gradients, and optimizer states. Even high-end GPUs like the RTX 4090, with 24GB of memory, can only handle about 15-20 million points, which is insufficient for city-scale scenes [2][3]. Group 2: CLM Design Principles - CLM is based on the observation that only a small fraction of Gaussian points are actively used during each rendering pass, with less than 1% of points accessed in large scenes [3]. - The system design of CLM involves dynamically loading Gaussian parameters from CPU memory as needed, rather than keeping all parameters in GPU memory [4]. Group 3: Key Mechanisms of CLM - Attribute Segmentation: CLM retains only "key attributes" (10 parameters) necessary for visibility checks in GPU memory, while the remaining 80% of "non-key attributes" are stored in CPU memory and loaded on demand [6][7]. - Pre-rendering Visibility Culling: Unlike traditional methods, CLM calculates visible Gaussian point indices before rendering, reducing unnecessary GPU computations and memory usage by only loading visible points from CPU memory [9][10]. - Efficient CPU-GPU Collaboration: CLM employs a multi-layered design to mitigate data transfer delays, including micro-batching, caching mechanisms, and intelligent scheduling to maximize efficiency and minimize communication overhead [12][13][14][15]. Group 4: Performance Results - CLM technology significantly increases model size, allowing for the training of 102.2 million Gaussian points on the "MatrixCity BigCity" dataset, a 6.7-fold increase compared to traditional methods, which maxed out at 15.3 million points [16]. - The quality of reconstruction improves with more parameters, achieving a PSNR of 25.15dB for the 102.2 million point model, compared to 23.93dB for the smaller model [18]. - Despite communication overhead, CLM maintains a training throughput of 55% to 90% of the enhanced baseline on the RTX 4090, and up to 86% to 97% on the slower RTX 2080 Ti [19]. Group 5: Broader Implications - CLM represents a significant advancement in addressing deployment bottlenecks in 3DGS training, integrating CPU resources into the training process without the need for multi-GPU setups, thus providing a cost-effective solution for large-scale scene reconstruction [21]. - The growing demand for efficient and low-cost 3D reconstruction tools in applications like digital twins and large-scale map reconstruction highlights the importance of CLM's approach in optimizing existing computational resources [21].