刷新NAVSIM SOTA！端到端自动驾驶新框架Masked Diffusion

Core Viewpoint - The article discusses the introduction of the WAM-Diff framework by Fudan University and Yiwang Intelligent, which innovatively integrates discrete masked diffusion models into Vision-Language-Action (VLA) for autonomous driving, addressing limitations of existing autoregressive models and enhancing planning capabilities [3][4][26]. Group 1: Framework and Innovations - WAM-Diff introduces a discrete masked diffusion model that allows for non-sequential generation, overcoming the limitations of traditional left-to-right autoregressive models [3][6]. - The framework employs a hybrid discrete action tokenization technique to convert continuous 2D trajectory coordinates into high-precision discrete tokens, facilitating a shared vocabulary for driving commands [6]. - The model incorporates a mixture of experts (MoE) architecture and online reinforcement learning (GSPO) to enhance adaptability and robustness in dynamic driving scenarios [12][14]. Group 2: Performance Metrics - In the NAVSIM benchmark, WAM-Diff achieved a state-of-the-art (SOTA) score of 91.0 PDMS in NAVSIM-v1, surpassing several leading baseline models [4][16]. - In NAVSIM-v2, which includes stricter metrics for traffic rule adherence and comfort, WAM-Diff maintained strong performance with an EPDMS score of 89.7, improving by 5.2 points over DiffusionDrive [18][19]. Group 3: Decoding Strategies - The framework explores three decoding strategies: causal, reverse-causal, and random, with reverse-causal yielding the best closed-loop performance, validating the "start with the end" planning intuition [9][20]. - The experiments demonstrated that prioritizing long-term driving intentions before detailing immediate actions significantly enhances the consistency and safety of generated trajectories [20][21]. Group 4: Conclusion - WAM-Diff represents a significant advancement in end-to-end autonomous driving planning, emphasizing the importance of both "how to generate" and "what to generate" in the VLA era, potentially paving the way towards Level 4 autonomous driving [26].