Wan Video
Search documents
一个模型统一4D世界生成与重建,港科大One4D框架来了
具身智能之心· 2026-01-14 02:02
Core Insights - The article discusses the advancements in video diffusion models, particularly focusing on the One4D framework developed by a research team from Hong Kong University of Science and Technology (HKUST), which aims to unify 4D generation and reconstruction tasks [3][7]. Group 1: Background and Framework - Video diffusion models have made significant progress in realism, dynamics, and controllability, but they often lack explicit modeling of 3D geometry, which limits their application in world model-driven tasks [3]. - One4D is introduced as a unified framework for 4D generation and reconstruction, capable of synchronously outputting RGB videos and Pointmaps (XYZ geometry videos) [3][7]. - The framework supports various input forms, including single images, sparse frames, and complete videos for 4D generation and reconstruction [8]. Group 2: Key Features of One4D - One4D features multi-modal output, including RGB and Pointmap, and employs Decoupled LoRA Control (DLC) to stabilize RGB while learning geometric alignment [7][10]. - Unified Masked Conditioning (UMC) allows One4D to handle different types of conditions in a single model, facilitating smooth transitions between generation and reconstruction tasks [14][16]. Group 3: Training Data and Methodology - The training of One4D requires large-scale paired data of "appearance - geometry," utilizing a mix of synthetic and real data to ensure geometric accuracy and realistic distribution [16]. - Synthetic data is generated through game engine rendering, providing stable supervision for Pointmaps, while real data is sourced from publicly available videos, supplemented with geometric annotations from existing 4D reconstruction methods [17]. Group 4: Experimental Results - One4D outperforms the 4DNeX model in user preference studies across various dimensions, including consistency, dynamics, aesthetics, depth quality, and overall 4D coherence [19][20]. - In complete video to 4D reconstruction tasks, One4D shows superior performance compared to reconstruction-only methods like MonST3R and CUT3R, demonstrating effective geometry reconstruction [22][24]. - The model also exhibits strong capabilities in generating 4D structures from sparse video frames, indicating its potential for dynamic scene generation [29][30]. Group 5: Conclusion - One4D enhances video diffusion models by enabling simultaneous generation of appearance and geometry, addressing critical stability and alignment issues in multi-task training [31]. - This framework represents a significant step towards creating 4D worlds that can be understood and interacted with, providing foundational capabilities for next-generation world models and multi-modal content creation [31].
一个模型统一4D世界生成与重建,港科大One4D框架来了
机器之心· 2026-01-13 00:12
Group 1 - The core idea of the article is the introduction of One4D, a unified framework for 4D generation and reconstruction that addresses the limitations of existing video diffusion models by enabling simultaneous output of RGB videos and geometric pointmaps [4][32]. - One4D aims to enhance the capabilities of video generation models by integrating both appearance (RGB) and geometry (Pointmap/Depth/Camera Trajectory) within a single framework, thus facilitating the transition towards a 4D world model [32][33]. - The framework employs two key innovations: Decoupled LoRA Control (DLC) for reducing cross-modal interference and Unified Masked Conditioning (UMC) for handling various input types seamlessly [10][17]. Group 2 - One4D supports three types of input: single image to 4D generation, sparse video frames to 4D generation and reconstruction, and complete video to 4D reconstruction [9]. - The training of One4D utilizes a large-scale dataset combining synthetic and real data to ensure both geometric accuracy and visual diversity, achieving effective results with 34,000 videos trained on 8 NVIDIA H800 GPUs over 5,500 steps [20]. - User studies indicate that One4D outperforms existing methods in consistency, dynamic quality, aesthetics, depth quality, and overall 4D coherence, with significant improvements in various metrics [21][22]. Group 3 - In the context of sparse video frames, One4D demonstrates the ability to generate missing RGB frames and complete the geometric sequence even under extreme sparsity conditions, showcasing its capability for dynamic 4D scene generation [30][31]. - One4D also excels in full video 4D reconstruction, outperforming dedicated reconstruction methods on benchmark datasets such as Sintel and Bonn, indicating its robust performance across different tasks [25][26]. - The framework's camera trajectory estimation capabilities are validated through evaluations on datasets like Sintel and TUM, further proving its effectiveness in unified generation and reconstruction tasks [28][29].