百度X-Driver：可闭环评测的VLA

Core Viewpoint - The article discusses the development and evaluation of X-Driver, a unified multimodal large language model (MLLM) framework designed for closed-loop autonomous driving, emphasizing the importance of closed-loop evaluation metrics for assessing the performance of autonomous driving systems [2][3][23]. Group 1: Methodology and Architecture - X-Driver utilizes a CoT (Chain of Thought) reasoning mechanism integrated within the MLLM to enhance decision-making in autonomous driving, processing inputs from camera data and navigation commands [6][11]. - The system operates in a closed-loop manner, where actions taken by the vehicle affect the real-world environment, generating new sensory data for continuous optimization [7][24]. - The architecture includes LLaVA, a multimodal model that aligns features from images and text, ensuring a comprehensive understanding of driving scenarios [9][10]. Group 2: Training and Reasoning Process - The CoT fusion training method employs high-quality CoT prompt data to improve reasoning and decision-making capabilities in driving scenarios [11][12]. - The model breaks down tasks into sub-tasks such as object detection and traffic signal interpretation, integrating these results to generate final driving decisions [17][18]. - The training process includes accurate perception of complex 3D driving environments and adherence to traffic regulations, ensuring safe navigation [15][22]. Group 3: Closed-loop Evaluation and Results - The closed-loop evaluation is conducted using the CARLA simulation environment, focusing on Driving Score and Success Rate as key performance indicators [27][28]. - The Bench2Drive dataset, containing over 2 million frames, is utilized to assess the closed-loop driving performance under various conditions [27]. - Results indicate that incorporating CoT reasoning significantly improves decision accuracy, with the success rate for closed-loop simulations still around 20% [30][31].