3D Scene Generation
Search documents
世界模型可单GPU秒级生成了?腾讯开源FlashWorld,效果惊艳、免费体验
机器之心· 2025-10-30 08:52
Core Insights - The collaboration between Xiamen University and Tencent has produced a highly regarded paper titled "FlashWorld: High-quality 3D Scene Generation within Seconds," which has gained significant attention both domestically and internationally, ranking first on the Huggingface Daily Paper list and receiving endorsements from prominent AI figures [2][4]. Group 1: FlashWorld's Performance - FlashWorld achieves 3D scene generation in 5 to 10 seconds on a single GPU, representing a speed increase of up to 100 times compared to previous methods [4]. - The generated scenes can be rendered in real-time on web user interfaces, surpassing the quality of other closed-source models [4]. - In comparative tests, FlashWorld produced stable, complete, and high-quality rendering results, being five times faster than the quick mode of Marble and eliminating the need for backend GPU connections like RTFM [6][10]. Group 2: Technical Approach - FlashWorld utilizes a technology route based on 3DGS for scene output, allowing for local web rendering, which is a significant advantage over video models that require heavy loads [8]. - The method combines a multi-view diffusion model with a three-dimensional focus, enhancing visual quality through a distillation process that ensures multi-view consistency and reduces denoising steps [10][12]. - The training process includes dual-mode pre-training and cross-mode post-training, which enhances the model's ability to generalize across various scenes, styles, and trajectories without needing ground truth data [13][16]. Group 3: Experimental Results - FlashWorld has demonstrated superior performance in generating structured scenes, such as fences, which were previously challenging to achieve [18]. - The model excels in generating fine details, such as hair, from text inputs, showcasing its capability in dense perspective reconstructions [21]. - In benchmark tests, FlashWorld outperformed other methods in speed and quality, achieving the highest average scores in various qualitative metrics [23][24].
上交OccScene:3D OCC生成新框架(TPAMI)
自动驾驶之心· 2025-10-23 00:04
Core Insights - The article discusses the integration of generative models with autonomous driving systems, emphasizing the need for high-quality, large-scale annotated data for training perception models, which is often costly and time-consuming [2] - OccScene is introduced as a solution that combines 3D scene generation with semantic occupancy perception through a novel joint diffusion framework, achieving a synergistic effect where the two tasks enhance each other [3] Innovation and Contributions - A unified perception-generation framework is proposed, where the perception model provides detailed geometric and semantic priors to the generator, creating a beneficial feedback loop [5] - The Mamba-based dual alignment module (MDA) is designed to efficiently align camera trajectories, semantic occupancy, and diffusion features, ensuring cross-view consistency and geometric accuracy in generated content [5] - OccScene demonstrates state-of-the-art (SOTA) performance, generating high-quality images/videos and corresponding 3D semantic occupancy information with just text prompts, significantly enhancing existing SOTA perception models [5] - The mutual learning mechanism promotes the model to find broader and more stable loss minima, avoiding local minima stagnation issues seen in independent learning [5] Comparison with Traditional Methods - OccScene employs a joint learning framework that promotes bidirectional enhancement, unlike traditional methods that treat generation and perception separately [7] - It requires only text prompts for flexible scene generation, contrasting with traditional methods that rely on real annotated data [7] - OccScene provides fine-grained semantic occupancy guidance for more precise geometry, moving away from the coarse geometric control of traditional approaches [7] - The generation process is driven by perception tasks, ensuring the practical utility of generated data [7] Technical Framework - The core of OccScene is the joint perception-generation diffusion framework, integrating semantic occupancy prediction with text-driven generation into a single diffusion process [8] - The training strategy consists of two phases: first, tuning the generator to understand occupancy constraints, and second, mutual learning to achieve bidirectional enhancement [9][10] - A dynamic weighted loss function is designed to balance the two tasks during joint optimization, ensuring stability in training [11][13] Experimental Results - OccScene achieves SOTA performance in 3D scene generation across various tasks, with significantly lower FID scores compared to traditional methods, indicating better quality [20][21] - The generated scenes exhibit more reasonable geometry and clearer details, maintaining high logical consistency in cross-view videos [20][23] - Using OccScene as a data augmentation strategy significantly improves the performance of existing SOTA perception models, demonstrating the high quality and information richness of the synthetic data [24][25] Applications and Value - OccScene is positioned as a critical tool for autonomous driving simulation, generating high-fidelity, diverse driving scenarios, particularly for corner cases, enhancing system robustness at a low cost [32] - It provides controllable and editable virtual environments for navigation and interaction in robotics and AR/VR applications [32] - As a plug-and-play data generator, OccScene addresses data scarcity issues for various downstream 3D vision tasks [32]