CVPR2025 WAD纯视觉端到端 | 冠军方案技术报告~

Core Viewpoint - The article discusses the advancements in end-to-end autonomous driving technology, highlighting the performance of the top competitor, Poutine, in a recent visual-based driving competition, emphasizing its robust training methodology and superior results [1][13]. Group 1: Technical Overview - The leading solution, Poutine, utilizes a 3B parameter Vision-Language Model (VLM) to address long-tail scenarios in visual end-to-end autonomous driving [1]. - The training process consists of two phases: - Phase one involves self-supervised pre-training using a combination of vision, language, and trajectory data, with a total of 83 hours of CoVLA data and 11 hours of Waymo long-tail dataset [2]. - Phase two focuses on fine-tuning through reinforcement learning (RL) using 500 segments of manually annotated data from the Waymo validation set to enhance robustness [2][8]. - The Poutine model achieved a Rater-Feedback Score (RFS) of 7.99 on the Waymo test set, leading the competition [2][13]. Group 2: Data and Methodology - The datasets used include CoVLA, which contains 10,000 front-view images and 30 seconds of driving video, and WOD-E2E, which provides 4,021 long-tail driving scenarios with trajectory information [11]. - The evaluation metric, RFS, is calculated based on the proximity of predicted trajectories to expert-rated trajectories, with a scoring range of 0 to 10 [11]. - The training details include a batch size of 64 and a learning rate of 1e-5 for the CoVLA dataset, while the WOD-E2E dataset used a batch size of 16 with similar training parameters [11]. Group 3: Results and Analysis - Poutine's performance significantly outperformed other models, with a notable score of 7.99, while the second-best model scored 7.91, indicating a substantial lead [13]. - The article notes that while the addition of RL did not drastically improve scores, it effectively addressed challenging scenarios [13]. - The results suggest that the combination of VLM and RL training enhances the model's ability to handle complex driving environments [18]. Group 4: Future Considerations - The article raises questions about the mainstream applicability of VLM and LLM in trajectory prediction, particularly regarding their understanding of the physical world and 3D trajectory information [19]. - It suggests that for conventional evaluation datasets, the advantages of such models may not be as pronounced, indicating a need for further exploration [19]. - The potential integration of action models with VLM for trajectory prediction is proposed as a more comprehensive approach [19].