刷新NAVSIM SOTA，复旦提出端到端自动驾驶新框架

Core Insights - The article discusses the transition in end-to-end autonomous driving from a modular approach to a unified paradigm with the rise of Vision-Language-Action (VLA) models, highlighting the limitations of existing autoregressive models in mimicking human driving intuition [1][2]. Group 1: WAM-Diff Framework - The WAM-Diff framework, developed by Fudan University and Yiwang Intelligence, introduces a Discrete Masked Diffusion model for VLA autonomous driving planning, integrating a sparse mixture of experts (MoE) architecture and online reinforcement learning (GSPO) [2][4]. - WAM-Diff achieved state-of-the-art (SOTA) performance on the NAVSIM benchmark, scoring 91.0 PDMS and 89.7 EPDMS, demonstrating the potential of non-autoregressive generation in complex driving scenarios [2][16][18]. Group 2: Technical Innovations - WAM-Diff employs Hybrid Discrete Action Tokenization to convert continuous 2D trajectory coordinates into high-precision discrete tokens, allowing for a shared vocabulary with driving commands [5]. - The framework utilizes Masked Diffusion for generation, enabling parallel prediction of all token positions, which enhances inference efficiency and allows for global optimization [5][9]. Group 3: Decoding Strategies - WAM-Diff explores three decoding strategies: causal, reverse-causal, and random, finding that the reverse-causal strategy yields the best performance in closed-loop metrics, aligning with the "end-to-begin" planning intuition [9][20]. - This approach confirms that establishing long-term driving intentions before detailing immediate actions significantly improves planning consistency and safety [9][20]. Group 4: MoE and GSPO Integration - The MoE architecture within WAM-Diff includes 64 lightweight experts, dynamically activated based on the driving context, enhancing model capacity and adaptability while controlling computational costs [12]. - The GSPO algorithm bridges the gap between open-loop training and closed-loop execution, optimizing trajectory sequences based on safety, compliance, and comfort metrics [12][14]. Group 5: Experimental Results - In extensive experiments on the NAVSIM benchmark, WAM-Diff outperformed several leading models, achieving a PDMS score of 91.0 and an EPDMS score of 89.7, indicating its robustness in balancing safety and compliance [16][18]. - The model's performance in NAVSIM-v2, which includes stricter metrics for traffic rule adherence and comfort, improved by 5.2 points over the previous best, showcasing its capability in real-world driving scenarios [18]. Group 6: Conclusion - WAM-Diff represents a significant advancement in autonomous driving planning, moving towards a discrete, structured, and closed-loop approach, emphasizing the importance of both "how to generate" and "what to generate" in the VLA era [25].