理想新一代世界模型首次实现实时场景编辑与VLA协同规划

Core Viewpoint - GeoDrive is a next-generation world model system for autonomous driving, developed collaboratively by Peking University, Berkeley AI Research (BAIR), and Li Auto, addressing the limitations of existing methods that rely on 2D modeling and lack 3D spatial perception, which can lead to unreasonable trajectories and distorted dynamic interactions [11][14]. Group 1: Key Innovations - Geometric Condition-Driven Generation: Utilizes 3D rendering to replace numerical control signals, effectively solving the action drift problem [6]. - Dynamic Editing Mechanism: Injects controllable motion into static point clouds, balancing efficiency and flexibility [7]. - Minimized Training Cost: Freezes the backbone model and employs lightweight adapters for efficient data training [8]. - Pioneering Applications: Achieves real-time scene editing and VLA (Vision-Language-Action) collaborative planning within the driving world model for the first time [9][10]. Group 2: Technical Details - 3D Geometry Integration: The system constructs a 3D representation from single RGB images, ensuring spatial consistency and coherence in scene structure [12][18]. - Dynamic Editing Module: Enhances the realism of multi-vehicle interaction scenarios during training by allowing flexible adjustments of movable objects [12]. - Video Diffusion Architecture: Combines rendered conditional sequences with noise features to enhance 3D geometric fidelity while maintaining photorealistic quality [12][33]. Group 3: Performance Metrics - GeoDrive significantly improves controllability of driving world models, reducing trajectory tracking error by 42% compared to the Vista model, and shows superior performance across various video quality metrics [19][34]. - The model demonstrates effective generalization to new perspective synthesis tasks, outperforming existing models like StreetGaussian in video quality [19][38]. Group 4: Conclusion - GeoDrive sets a new benchmark in autonomous driving by enhancing action controllability and spatial accuracy through explicit trajectory control and direct visual condition input, while also supporting applications like non-ego vehicle perspective generation and scene editing [41].