Joint Learning
Search documents
上交OccScene:3D OCC生成新框架(TPAMI)
自动驾驶之心· 2025-10-23 00:04
Core Insights - The article discusses the integration of generative models with autonomous driving systems, emphasizing the need for high-quality, large-scale annotated data for training perception models, which is often costly and time-consuming [2] - OccScene is introduced as a solution that combines 3D scene generation with semantic occupancy perception through a novel joint diffusion framework, achieving a synergistic effect where the two tasks enhance each other [3] Innovation and Contributions - A unified perception-generation framework is proposed, where the perception model provides detailed geometric and semantic priors to the generator, creating a beneficial feedback loop [5] - The Mamba-based dual alignment module (MDA) is designed to efficiently align camera trajectories, semantic occupancy, and diffusion features, ensuring cross-view consistency and geometric accuracy in generated content [5] - OccScene demonstrates state-of-the-art (SOTA) performance, generating high-quality images/videos and corresponding 3D semantic occupancy information with just text prompts, significantly enhancing existing SOTA perception models [5] - The mutual learning mechanism promotes the model to find broader and more stable loss minima, avoiding local minima stagnation issues seen in independent learning [5] Comparison with Traditional Methods - OccScene employs a joint learning framework that promotes bidirectional enhancement, unlike traditional methods that treat generation and perception separately [7] - It requires only text prompts for flexible scene generation, contrasting with traditional methods that rely on real annotated data [7] - OccScene provides fine-grained semantic occupancy guidance for more precise geometry, moving away from the coarse geometric control of traditional approaches [7] - The generation process is driven by perception tasks, ensuring the practical utility of generated data [7] Technical Framework - The core of OccScene is the joint perception-generation diffusion framework, integrating semantic occupancy prediction with text-driven generation into a single diffusion process [8] - The training strategy consists of two phases: first, tuning the generator to understand occupancy constraints, and second, mutual learning to achieve bidirectional enhancement [9][10] - A dynamic weighted loss function is designed to balance the two tasks during joint optimization, ensuring stability in training [11][13] Experimental Results - OccScene achieves SOTA performance in 3D scene generation across various tasks, with significantly lower FID scores compared to traditional methods, indicating better quality [20][21] - The generated scenes exhibit more reasonable geometry and clearer details, maintaining high logical consistency in cross-view videos [20][23] - Using OccScene as a data augmentation strategy significantly improves the performance of existing SOTA perception models, demonstrating the high quality and information richness of the synthetic data [24][25] Applications and Value - OccScene is positioned as a critical tool for autonomous driving simulation, generating high-fidelity, diverse driving scenarios, particularly for corner cases, enhancing system robustness at a low cost [32] - It provides controllable and editable virtual environments for navigation and interaction in robotics and AR/VR applications [32] - As a plug-and-play data generator, OccScene addresses data scarcity issues for various downstream 3D vision tasks [32]