闭环端到端暴涨20%！华科&小米打造开源框架ORION

Core Viewpoint - The article discusses the advancements in end-to-end (E2E) autonomous driving technology, particularly focusing on the introduction of the ORION framework, which integrates vision-language models (VLM) for improved decision-making in complex environments [3][30]. Summary by Sections Introduction - Recent progress in E2E autonomous driving technology faces challenges in complex closed-loop interactions due to limited causal reasoning capabilities [3][12]. - VLMs offer new hope for E2E autonomous driving but there remains a significant gap between VLM's semantic reasoning space and the numerical action space required for driving [3][17]. ORION Framework - ORION is proposed as an end-to-end autonomous driving framework that utilizes visual-language instructions for trajectory generation [3][18]. - The framework incorporates QT-Former for aggregating long-term historical context, VLM for scene understanding and reasoning, and a generative model to align reasoning and action spaces [3][16][18]. Performance Evaluation - ORION achieved a driving score of 77.74 and a success rate of 54.62% on the challenging Bench2Drive dataset, outperforming previous state-of-the-art (SOTA) methods by 14.28 points and 19.61% in success rate [5][24]. - The framework demonstrated superior performance in specific driving scenarios such as overtaking (71.11%), emergency braking (78.33%), and traffic sign recognition (69.15%) [26]. Key Contributions - The article highlights several key contributions of ORION: 1. QT-Former enhances the model's understanding of historical scenes by effectively aggregating long-term visual context [20]. 2. VLM enables multi-dimensional analysis of driving scenes, integrating user instructions and historical information for action reasoning [21]. 3. The generative model aligns the reasoning space of VLM with the action space for trajectory prediction, ensuring reasonable driving decisions in complex scenarios [22]. Conclusion - ORION provides a novel solution for E2E autonomous driving by achieving semantic and action space alignment, integrating long-term context aggregation, and jointly optimizing visual understanding and path planning tasks [30].