Percept-WAM：真正「看懂世界」的自动驾驶大脑，感知到行动的一体化模型

Core Viewpoint - The article discusses the limitations of current large visual language models (VLMs) in autonomous driving, emphasizing the need for enhanced spatial perception and geometric understanding to support robust decision-making in real-world scenarios [2][3]. Group 1: Model Introduction - A new model named Percept-WAM (Perception-Enhanced World–Awareness–Action Model) has been proposed, aiming to integrate perception, world awareness, and vehicle action into a cohesive framework for autonomous driving [3][4]. - Percept-WAM is designed to create a complete link from perception to decision-making, addressing the shortcomings of existing models that struggle with real-world complexities [3][4]. Group 2: Model Architecture - The architecture of Percept-WAM incorporates a general reasoning VLM backbone while introducing World-PV and World-BEV tokens to unify 2D/3D perception representations [5]. - The model employs a grid-conditioned prediction mechanism and IoU-aware confidence outputs to enhance the accuracy and efficiency of its outputs, along with a lightweight action decoding head for efficient trajectory prediction [5][6]. Group 3: Training Tasks - Percept-WAM is trained using multi-view streaming video, LiDAR point clouds (optional), and text queries, optimizing various tasks such as 2D detection, instance segmentation, semantic segmentation, and 3D detection [6][9]. - The model's training approach allows for joint optimization across multiple tasks, enhancing the overall performance through shared geometric and semantic information [23]. Group 4: Performance Evaluation - In public benchmarks, Percept-WAM demonstrates competitive performance in PV perspective perception, BEV perspective perception, and end-to-end trajectory planning compared to existing models [21][30]. - Specifically, in the PV perspective, Percept-WAM achieves a 49.9 mAP in 2D detection, surpassing the performance of specialized models like Mask R-CNN [22][24]. - In the BEV perspective, the model achieves a 58.9 mAP in 3D detection, outperforming traditional BEV detection methods [27][28]. Group 5: Confidence Prediction - The introduction of IoU-based confidence prediction significantly improves the alignment between predicted confidence scores and actual localization quality, enhancing the reliability of dense detection [25]. Group 6: Decision-Making Integration - Percept-WAM integrates World–Action tokens for action and trajectory prediction, allowing for a seamless transition from world modeling to decision output, thus aligning perception and planning in a unified representation space [16][17]. - The model employs a query-based trajectory prediction method that leverages multiple feature groups, enhancing the efficiency and accuracy of trajectory planning [19]. Group 7: Future Implications - Percept-WAM represents a forward-looking evolution in autonomous driving, emphasizing the importance of a unified model that can perceive, understand, and act within the world, moving beyond traditional models that merely process language [41].