清华UniMM-V2X：基于MOE的多层次融合端到端V2X框架

Core Insights - The article discusses the limitations of traditional modular autonomous driving systems and introduces the UniMM-V2X framework, which enhances multi-agent end-to-end systems through multi-level collaboration in perception and prediction [1][3][25] - UniMM-V2X utilizes a mixture of experts (MoE) architecture to improve the adaptability and specialization of perception, prediction, and planning tasks, achieving state-of-the-art (SOTA) performance [1][7][25] Group 1: UniMM-V2X Framework - UniMM-V2X consists of three main components: an image encoder, a collaborative perception module, and a collaborative prediction and planning module, all integrated with MoE architecture [8][24] - The framework enhances planning by integrating information from multiple agents at both perception and prediction levels, significantly improving decision-making reliability in complex scenarios [6][7][8] Group 2: Performance Metrics - The framework demonstrated a 39.7% improvement in perception accuracy, a 7.2% reduction in prediction error, and a 33.2% enhancement in planning performance, showcasing the effectiveness of the MoE-enhanced multi-level collaboration paradigm [7][25] - In the DAIR-V2X benchmark tests, UniMM-V2X achieved the lowest average planning error of 1.49 meters and a collision rate of only 0.12% over 3 seconds, outperforming all baseline models [15][16][25] Group 3: Comparative Analysis - Compared to the leading single-agent driving solution SparseDrive, UniMM-V2X improved mean Average Precision (mAP) by 39.7% and Average Multi-Object Tracking Accuracy (AMOTA) by 77.2% without incurring additional communication costs [17][25] - In motion prediction, UniMM-V2X achieved a minimum Average Displacement Error (minADE) of 0.64 meters and a minimum Final Displacement Error (minFDE) of 0.69 meters, contributing significantly to overall planning performance [19][20][25] Group 4: Multi-Level Fusion and MoE Impact - The multi-level fusion approach ensures high-quality intermediate features are propagated throughout the framework, leading to performance improvements across all modules [22][23] - The integration of MoE in both the encoder and decoder yields the best results, enhancing environmental understanding and capturing complex motion behaviors effectively [22][23] Group 5: Practicality and Reliability - UniMM-V2X significantly reduced communication costs by 87.9 times compared to traditional methods while maintaining planning quality, achieving a frame rate of 5.4 FPS [24][25] - The framework demonstrates reliability and scalability under various bandwidth conditions, making it suitable for real-world autonomous driving applications [24][25]