Multimodal Spatiotemporal Transformer

Search documents
地平线&清华Epona:自回归式世界端到端模型~
自动驾驶之心· 2025-08-12 23:33
Core Viewpoint - The article discusses a unified framework for autonomous driving world models that can generate long-term high-resolution video while providing real-time trajectory planning, addressing limitations of existing methods [5][12]. Group 1: Existing Methods and Limitations - Current diffusion models, such as Vista, can only generate fixed-length videos (≤15 seconds) and struggle with flexible long-term predictions (>2 minutes) and multi-modal trajectory control [7]. - GPT-style autoregressive models, like GAIA-1, can extend indefinitely but require discretizing images into tokens, which degrades visual quality and lacks continuous action trajectory generation capabilities [7][13]. Group 2: Proposed Methodology - The proposed world model in the autonomous driving domain uses a series of forward camera observations and corresponding driving trajectories to predict future driving dynamics [10]. - The framework decouples spatiotemporal modeling using causal attention in a GPT-style transformer and a dual-diffusion transformer for spatial rendering and trajectory generation [12]. - An asynchronous multimodal generation mechanism allows for parallel generation of 3-second trajectories and the next frame image, achieving 20Hz real-time planning with a 90% reduction in inference computational power [12]. Group 3: Model Structure and Training - The Multimodal Spatiotemporal Transformer (MST) encodes past driving scenes and action sequences, enhancing temporal position encoding for implicit representation [16]. - The Trajectory Planning Diffusion Transformer (TrajDiT) and Next-frame Prediction Diffusion Transformer (VisDiT) are designed to handle trajectory and image predictions, respectively, with a focus on action control [21]. - A chain-of-forward training strategy is employed to mitigate the "drift problem" in autoregressive inference by simulating prediction noise during training [24]. Group 4: Performance Evaluation - The model demonstrates superior performance in video generation metrics, achieving a FID score of 7.5 and a FVD score of 82.8, outperforming several existing models [28]. - In trajectory control metrics, the proposed method achieves a high accuracy rate of 97.9% in comparison to other methods [34]. Group 5: Conclusion and Future Directions - The framework integrates image generation and vehicle trajectory prediction with high quality, showing strong potential for applications in closed-loop simulation and reinforcement learning [36]. - However, the current model is limited to single-camera input, indicating a need for addressing multi-camera consistency and point cloud generation challenges in the autonomous driving field [36].