DataCrafter
Search documents
登上NeurIPS,Genesis开创无需OCC引导的多模态生成新范式,在视频与激光雷达指标上达到SOTA水平
机器之心· 2025-09-28 04:50
Core Insights - The article discusses the Genesis framework, a multimodal image-point cloud generation algorithm developed by Huazhong University of Science and Technology and Xiaomi Auto, which does not require occupancy (OCC) guidance for generating realistic driving scene data [2][4]. Group 1: Genesis Framework Overview - Genesis employs a two-stage architecture: the first stage uses a perspective projection layout and scene descriptions to learn 3D features, while the second stage converts multi-view video sequences into a bird's-eye view feature space [4]. - The framework introduces DataCrafter, a data annotation module based on Visual Language Models (VLM), to provide structured semantic information for guiding the generation process [10][13]. Group 2: Challenges in Current Driving Scene Generation - Existing methods primarily focus on single-modal data generation, either RGB video or LiDAR point clouds, which limits the potential for deep collaboration and consistent expression between visual and geometric modalities [7][8]. - The high cost of obtaining OCC labels in real-world driving scenarios restricts the application of existing multimodal generation models in the industry [8]. Group 3: DataCrafter Module - DataCrafter is designed to filter training data and extract structured semantic information, ensuring high-quality segments are used for training and providing detailed semantic guidance for the generation tasks [13][18]. - The module evaluates video segments based on visual attributes such as clarity, structural coherence, and aesthetic qualities, retaining only those that meet a set threshold [15]. Group 4: Video Generation Model - The video generation model within Genesis integrates scene layout information and language descriptions through attention mechanisms, enhancing the semantic expression of dynamic scenes [19]. - Innovations include using YOLOv8x-Pose for detecting pedestrian poses, which are then projected across various views to improve the generation of realistic driving scenarios [19]. Group 5: Performance Metrics - In experiments on the nuScenes dataset, Genesis achieved a multi-frame FVD of 83.10 and a multi-frame FID of 14.90 without initial frame conditions, outperforming previous methods [26]. - For LiDAR generation, Genesis demonstrated superior performance with a Chamfer distance of 0.611 at 1-second prediction, surpassing the previous best by 21% [27]. Group 6: Downstream Task Evaluation - The generated data from Genesis was evaluated in downstream perception tasks, showing improvements in mean Average Precision (mAP) and NuScenes Detection Score (NDS) across various settings [30]. - The combination of camera and LiDAR modalities in generation tasks yielded the highest gains, demonstrating the complementary advantages of multimodal generation [30].