Workflow
Video Diffusion Model
icon
Search documents
一个模型统一4D世界生成与重建,港科大One4D框架来了
机器之心· 2026-01-13 00:12
Group 1 - The core idea of the article is the introduction of One4D, a unified framework for 4D generation and reconstruction that addresses the limitations of existing video diffusion models by enabling simultaneous output of RGB videos and geometric pointmaps [4][32]. - One4D aims to enhance the capabilities of video generation models by integrating both appearance (RGB) and geometry (Pointmap/Depth/Camera Trajectory) within a single framework, thus facilitating the transition towards a 4D world model [32][33]. - The framework employs two key innovations: Decoupled LoRA Control (DLC) for reducing cross-modal interference and Unified Masked Conditioning (UMC) for handling various input types seamlessly [10][17]. Group 2 - One4D supports three types of input: single image to 4D generation, sparse video frames to 4D generation and reconstruction, and complete video to 4D reconstruction [9]. - The training of One4D utilizes a large-scale dataset combining synthetic and real data to ensure both geometric accuracy and visual diversity, achieving effective results with 34,000 videos trained on 8 NVIDIA H800 GPUs over 5,500 steps [20]. - User studies indicate that One4D outperforms existing methods in consistency, dynamic quality, aesthetics, depth quality, and overall 4D coherence, with significant improvements in various metrics [21][22]. Group 3 - In the context of sparse video frames, One4D demonstrates the ability to generate missing RGB frames and complete the geometric sequence even under extreme sparsity conditions, showcasing its capability for dynamic 4D scene generation [30][31]. - One4D also excels in full video 4D reconstruction, outperforming dedicated reconstruction methods on benchmark datasets such as Sintel and Bonn, indicating its robust performance across different tasks [25][26]. - The framework's camera trajectory estimation capabilities are validated through evaluations on datasets like Sintel and TUM, further proving its effectiveness in unified generation and reconstruction tasks [28][29].