Core Insights - The article discusses the advancements in video diffusion models, particularly focusing on the One4D framework developed by a research team from Hong Kong University of Science and Technology (HKUST), which aims to unify 4D generation and reconstruction tasks [3][7]. Group 1: Background and Framework - Video diffusion models have made significant progress in realism, dynamics, and controllability, but they often lack explicit modeling of 3D geometry, which limits their application in world model-driven tasks [3]. - One4D is introduced as a unified framework for 4D generation and reconstruction, capable of synchronously outputting RGB videos and Pointmaps (XYZ geometry videos) [3][7]. - The framework supports various input forms, including single images, sparse frames, and complete videos for 4D generation and reconstruction [8]. Group 2: Key Features of One4D - One4D features multi-modal output, including RGB and Pointmap, and employs Decoupled LoRA Control (DLC) to stabilize RGB while learning geometric alignment [7][10]. - Unified Masked Conditioning (UMC) allows One4D to handle different types of conditions in a single model, facilitating smooth transitions between generation and reconstruction tasks [14][16]. Group 3: Training Data and Methodology - The training of One4D requires large-scale paired data of "appearance - geometry," utilizing a mix of synthetic and real data to ensure geometric accuracy and realistic distribution [16]. - Synthetic data is generated through game engine rendering, providing stable supervision for Pointmaps, while real data is sourced from publicly available videos, supplemented with geometric annotations from existing 4D reconstruction methods [17]. Group 4: Experimental Results - One4D outperforms the 4DNeX model in user preference studies across various dimensions, including consistency, dynamics, aesthetics, depth quality, and overall 4D coherence [19][20]. - In complete video to 4D reconstruction tasks, One4D shows superior performance compared to reconstruction-only methods like MonST3R and CUT3R, demonstrating effective geometry reconstruction [22][24]. - The model also exhibits strong capabilities in generating 4D structures from sparse video frames, indicating its potential for dynamic scene generation [29][30]. Group 5: Conclusion - One4D enhances video diffusion models by enabling simultaneous generation of appearance and geometry, addressing critical stability and alignment issues in multi-task training [31]. - This framework represents a significant step towards creating 4D worlds that can be understood and interacted with, providing foundational capabilities for next-generation world models and multi-modal content creation [31].
一个模型统一4D世界生成与重建,港科大One4D框架来了
具身智能之心·2026-01-14 02:02