TriMap视频扩散模型

Search documents
两张图就能重构3D空间?清华&NTU利用生成模型解锁空间智能新范式
量子位· 2025-07-09 01:18
Core Viewpoint - LangScene-X introduces a generative framework that enables the construction of generalized 3D language-embedded scenes using only sparse views, significantly reducing the number of required input images compared to traditional methods like NeRF, which typically need over 20 views [2][5]. Group 1: Challenges in 3D Language Scene Generation - The current 3D language scene generation faces three core challenges: the contradiction between dense view dependency and sparse input absence, leading to severe 3D structure artifacts and semantic distortion when using only 2-3 images [5]. - There is a disconnection in cross-modal information and a lack of 3D consistency, as existing models process appearance, geometry, and semantics independently, resulting in semantic misalignment [6]. - High-dimensional compression of language features and the bottleneck in generalization capabilities hinder practical applications, with existing methods showing a significant drop in accuracy when switching scenes [7]. Group 2: Solutions Offered by LangScene-X - LangScene-X employs the TriMap video diffusion model, which allows for unified multimodal generation under sparse input conditions, achieving significant improvements in RGB and normal consistency errors and semantic mask boundary accuracy [8]. - The Language Quantization Compressor (LQC) revolutionizes high-dimensional feature compression, mapping high-dimensional CLIP features to 3D discrete indices with minimal reconstruction error, enhancing cross-scene transferability [9][10]. - The model integrates a progressive training strategy that ensures the seamless generation of RGB images, normal maps, and semantic segmentation maps, thus improving the efficiency of 3D reconstruction processes [14]. Group 3: Spatial Intelligence and Performance Metrics - LangScene-X enhances spatial intelligence by accurately aligning text prompts with 3D scene surfaces, allowing for natural language queries to identify objects within 3D environments [15]. - Empirical results demonstrate that LangScene-X achieves an overall mean accuracy (mAcc) of 80.85% and a mean intersection over union (mIoU) of 50.52% on the LERF-OVS dataset, significantly outperforming existing methods [16]. - The model's capabilities position it as a potential core driver for applications in VR scene construction, human-computer interaction, and foundational technologies for autonomous driving and embodied intelligence [18].