Workflow
端到端自动驾驶
icon
Search documents
工业界大佬带队!三个月搞定端到端自动驾驶
自动驾驶之心· 2025-09-29 08:45
Core Viewpoint - 2023 is identified as the year of end-to-end production, with 2024 expected to be a significant year for this development in the automotive industry, particularly in autonomous driving technology [1][3]. Group 1: End-to-End Production - Leading new forces and manufacturers have already achieved end-to-end production [1]. - There are two main paradigms in the industry: one-stage and two-stage approaches, with UniAD being a representative of the one-stage method [1]. Group 2: Development Trends - Since last year, the one-stage end-to-end approach has rapidly evolved, leading to various derivatives such as perception-based, world model-based, diffusion model-based, and VLA-based one-stage methods [3]. - Major autonomous driving companies are focusing on self-research and mass production of end-to-end autonomous driving solutions [3]. Group 3: Course Offerings - A course titled "End-to-End and VLA Autonomous Driving" has been launched, covering cutting-edge algorithms in both one-stage and two-stage end-to-end approaches [5]. - The course aims to provide insights into the latest technologies in the field, including BEV perception, visual language models, diffusion models, and reinforcement learning [5]. Group 4: Course Structure - The course consists of several chapters, starting with an introduction to end-to-end algorithms, followed by background knowledge essential for understanding the technology stack [9][10]. - The second chapter focuses on the most frequently asked technical keywords in job interviews over the next two years [10]. - Subsequent chapters delve into two-stage end-to-end methods, one-stage end-to-end methods, and practical assignments involving RLHF fine-tuning [12][13]. Group 5: Learning Outcomes - Upon completion, participants are expected to reach a level equivalent to one year of experience as an end-to-end autonomous driving algorithm engineer [19]. - The course aims to deepen understanding of key technologies such as BEV perception, multimodal large models, and reinforcement learning, enabling participants to apply learned concepts to real projects [19].
会自检的VLA!ReflectDrive:更安全更高效scaling的端到端框架(理想&清华)
自动驾驶之心· 2025-09-27 23:33
Core Viewpoint - ReflectDrive is a novel learning framework that integrates a reflective mechanism to achieve safe trajectory generation through discrete diffusion, addressing the challenges in end-to-end autonomous driving systems [4][46]. Group 1: Introduction and Background - Autonomous driving is leading the transportation industry towards a safer and more efficient future, with end-to-end (E2E) systems becoming a mainstream alternative to traditional modular designs [4]. - Visual-Language-Action (VLA) models combine pre-trained knowledge from visual-language models (VLM) to enhance adaptability in complex scenarios [4][5]. - Current learning-based methods have not resolved core challenges in imitation learning driving systems, particularly in encoding physical rules like collision avoidance [4][5]. Group 2: ReflectDrive Framework - ReflectDrive proposes a new learning framework that utilizes a discrete diffusion reflective mechanism for safe trajectory generation [3][12]. - The framework begins by discretizing the two-dimensional driving space to construct an action codebook, allowing fine-tuning of pre-trained diffusion language models for planning tasks [3][14]. - The reflective mechanism operates without gradient calculations, enabling iterative self-correction inspired by spatiotemporal joint planning [3][8]. Group 3: Methodology and Mechanism - The reflective inference process consists of two stages: target condition trajectory generation and safety-guided regeneration [20][25]. - The framework integrates safety metrics to evaluate generated multimodal trajectories, identifying unsafe path points through local search methods [8][25]. - The iterative optimization loop continues until the trajectory is deemed safe or computational limits are reached, ensuring high efficiency in real-time performance [31][32]. Group 4: Experimental Results - ReflectDrive was evaluated on the NAVSIM benchmark, demonstrating significant improvements in safety metrics such as collision rates and compliance with drivable areas [32][38]. - The introduction of the safety-guided regeneration mechanism led to substantial enhancements in safety indicators, with notable increases in DAC (3.9%), TTC (1.3%), NC (0.8%), and EP (7.9%) compared to the baseline [37][38]. - When using ground-truth agent information, ReflectDrive's performance approached human driving levels, achieving NC of 99.7% and DAC of 99.5% [38][39]. Group 5: Conclusion - ReflectDrive effectively integrates a reflective mechanism with discrete diffusion for safe trajectory generation, validated by its performance on the NAVSIM benchmark [46].
对比之后,VLA的成熟度远高于世界模型...
自动驾驶之心· 2025-09-26 16:03
Core Insights - The article discusses the competition between VLA (Vision-Language Action) models and world models in the field of end-to-end autonomous driving, highlighting that over 90% of current models are segmented end-to-end rather than purely VLA or world models [2][6]. Group 1: Model Comparison - VLA models, represented by companies like Gaode Map and Horizon Robotics, show superior performance compared to world models, with the latest VLA papers published in September 2023 [6][43]. - The performance metrics of various models indicate that VLA models outperform world models significantly, with the best VLA model achieving an average L2 distance of 0.19 meters and a collision rate of 0.08% [5][6]. Group 2: Data Utilization - The Shanghai AI Lab's GenAD model utilizes unlabelled data sourced from the internet, primarily YouTube, to enhance generalization capabilities, contrasting with traditional supervised learning methods that rely on labeled data [7][19]. - The GenAD framework employs a two-tier training approach similar to Tesla's, integrating diffusion models and Transformers, but requires high-precision maps and traffic rules for effective operation [26][32]. Group 3: Testing Methods - Two primary testing methods for end-to-end autonomous driving are identified: open-loop testing using synthetic data in simulators like CARLA, and closed-loop testing based on real-world collected data [4][6]. - The article emphasizes the limitations of open-loop testing, which cannot provide feedback on the execution of predicted actions, making closed-loop testing more reliable for evaluating model performance [4][6]. Group 4: Future Directions - The article suggests that while world models have potential, their current implementations often require additional labeled data, which diminishes their advantages in generalization and cost-effectiveness compared to VLA models [43]. - The ongoing research and development in the field indicate a trend towards improving the integration of various data sources and enhancing model robustness through advanced training techniques [19][32].
AnchDrive:一种新端到端自动驾驶扩散策略(上大&博世)
自动驾驶之心· 2025-09-26 07:50
Core Insights - The article introduces AnchDrive, an end-to-end framework for autonomous driving that effectively addresses the challenges of multimodal behavior and generalization in long-tail scenarios [1][10][38] - AnchDrive utilizes a hybrid trajectory anchor approach, combining dynamic and static anchors to enhance trajectory quality and robustness in planning [10][38] Group 1: Introduction and Background - End-to-end autonomous driving algorithms have gained significant attention due to their superior scalability and adaptability compared to traditional rule-based motion planning methods [4][12] - These methods learn control signals directly from raw sensor data, reducing the complexity of modular design and minimizing cumulative perception errors [4][12] Group 2: Methodology - AnchDrive employs a multi-head trajectory decoder that dynamically generates a set of trajectory anchors, capturing behavioral diversity under local environmental conditions [8][15] - The framework integrates a large-scale static anchor set derived from human driving data, providing cross-scenario behavioral prior knowledge [8][15] Group 3: Experimental Results - In the NAVSIM v2 simulation platform, AnchDrive achieved an Extended Predictive Driver Model Score (EPDMS) of 85.5, indicating its ability to generate robust and contextually appropriate behaviors in complex driving scenarios [9][30][34] - The performance of AnchDrive was significantly higher than existing methods, with an 8.9 point increase in EPDMS compared to VADv2, while reducing the number of trajectory anchors from 8192 to just 20 [34] Group 4: Contributions - The main contributions of the article include the introduction of the AnchDrive framework, which utilizes a truncated diffusion process initialized from a hybrid trajectory anchor set, significantly improving initial trajectory quality and planning robustness [10][38] - The design of a mixed perception model with dense and sparse branches enhances the planner's understanding of obstacles and road geometry [11][18]
如何向一段式端到端注入类人思考的能力?港科OmniScene提出了一种新的范式...
自动驾驶之心· 2025-09-25 23:33
Core Insights - The article discusses the limitations of current autonomous driving systems in achieving true scene understanding and proposes a new framework called OmniScene, which integrates human-like cognitive abilities into the driving process [11][13][14]. Group 1: OmniScene Framework - OmniScene introduces a visual-language model (OmniVLM) that combines panoramic perception with temporal fusion capabilities for comprehensive 4D scene understanding [2][14]. - The framework employs a teacher-student architecture for knowledge distillation, embedding textual representations into 3D instance features to enhance semantic supervision [2][15]. - A hierarchical fusion strategy (HFS) is proposed to address the imbalance in contributions from different modalities during multi-modal fusion, allowing for adaptive calibration of geometric and semantic features [2][16]. Group 2: Performance Evaluation - OmniScene was evaluated on the nuScenes dataset, outperforming over ten mainstream models across various tasks, establishing new benchmarks for perception, prediction, planning, and visual question answering (VQA) [3][16]. - Notably, OmniScene achieved a significant 21.40% improvement in visual question answering performance, demonstrating its robust multi-modal reasoning capabilities [3][16]. Group 3: Human-like Scene Understanding - The framework aims to replicate human visual processing by continuously converting sensory input into scene understanding, adjusting attention based on dynamic driving environments [11][14]. - OmniVLM is designed to process multi-view and multi-frame visual inputs, enabling comprehensive scene perception and attention reasoning [14][15]. Group 4: Multi-modal Learning - The proposed HFS combines 3D instance representations with multi-view visual inputs and semantic attention derived from textual cues, enhancing the model's ability to understand complex driving scenarios [16][19]. - The integration of visual and textual modalities aims to improve the model's contextual awareness and decision-making processes in dynamic environments [19][20]. Group 5: Challenges and Solutions - The article highlights challenges in integrating visual-language models (VLMs) into autonomous driving, such as the need for domain-specific knowledge and real-time safety requirements [20][21]. - Solutions include designing driving attention prompts and developing new end-to-end visual-language reasoning methods to address safety-critical driving scenarios [22].
FlowDrive:一个具备软硬约束的可解释端到端框架(上交&博世)
自动驾驶之心· 2025-09-22 23:34
Core Insights - The article introduces FlowDrive, a novel end-to-end driving framework that integrates energy-based flow field representation, adaptive anchor trajectory optimization, and motion-decoupled trajectory generation to enhance safety and interpretability in autonomous driving [4][45]. Group 1: Introduction and Background - End-to-end autonomous driving has gained attention for its potential to simplify traditional modular pipelines and leverage large-scale data for joint learning of perception, prediction, and planning tasks [4]. - A mainstream research direction involves generating Bird's Eye View (BEV) representations from multi-view camera inputs, which provide structured spatial views beneficial for downstream planning tasks [4][6]. Group 2: FlowDrive Framework - FlowDrive introduces energy-based flow fields in the BEV space to explicitly model geometric constraints and rule-based semantics, enhancing the effectiveness of BEV representations [7][15]. - The framework includes a flow-aware anchor trajectory optimization module that aligns initial trajectories with safe and goal-oriented areas, improving spatial effectiveness and intention consistency [15][22]. - A task-decoupled diffusion planner separates high-level intention prediction from low-level trajectory denoising, allowing for targeted supervision and flow field conditional decoding [9][27]. Group 3: Experimental Results - Experiments on the NAVSIM v2 benchmark dataset demonstrate that FlowDrive achieves state-of-the-art performance, with an Extended Predictive Driver Model Score (EPDMS) of 86.3, surpassing previous benchmark methods [3][40]. - FlowDrive shows significant advantages in safety-related metrics such as Drivable Area Compliance (DAC) and Time to Collision (TTC), indicating superior adherence to driving constraints and hazard avoidance capabilities [40][41]. - The framework's performance is validated through ablation studies, showing that removing any core component leads to significant declines in overall performance [43][47]. Group 4: Technical Details - The flow field learning module encodes dense, physically interpretable spatial gradients to provide fine-grained guidance for trajectory planning [20][21]. - The perception module utilizes a Transformer-based architecture to effectively fuse multi-modal sensor inputs into a compact and semantically rich BEV representation [18][37]. - The training process involves a composite loss function that supervises trajectory planning, anchor trajectory optimization, flow field modeling, and auxiliary perception tasks [30][31][32][34].
苦战七年卷了三代!关于BEV的演进之路:哈工大&清华最新综述
自动驾驶之心· 2025-09-17 23:33
Core Viewpoint - The article discusses the evolution of Bird's Eye View (BEV) perception as a foundational technology for autonomous driving, highlighting its importance in ensuring safety and reliability in complex driving environments [2][4]. Group 1: Essence of BEV Perception - BEV perception is an efficient spatial representation paradigm that projects heterogeneous data from various sensors (like cameras, LiDAR, and radar) into a unified BEV coordinate system, facilitating a consistent structured spatial semantic map [6][12]. - This top-down view significantly reduces the complexity of multi-view and multi-modal data fusion, aiding in the accurate perception and understanding of spatial relationships between objects [6][12]. Group 2: Importance of BEV Perception - With a unified and interpretable spatial representation, BEV perception serves as an ideal foundation for multi-modal fusion and multi-agent collaborative perception in autonomous driving [8][12]. - The integration of heterogeneous sensor data into a common BEV plane allows for seamless alignment and integration, enhancing the efficiency of information sharing between vehicles and infrastructure [8][12]. Group 3: Implementation of BEV Perception - The evolution of safety-oriented BEV perception (SafeBEV) is categorized into three main stages: SafeBEV 1.0 (single-modal vehicle perception), SafeBEV 2.0 (multi-modal vehicle perception), and SafeBEV 3.0 (multi-agent collaborative perception) [12][17]. - Each stage represents advancements in technology and features, addressing the increasing complexity of dynamic traffic scenarios [12][17]. Group 4: SafeBEV 1.0 - Single-Modal Vehicle Perception - This stage utilizes a single sensor (like a camera or LiDAR) for BEV scene understanding, with methods evolving from homography transformations to data-driven BEV modeling [13][19]. - The performance of camera-based methods is sensitive to lighting changes and occlusions, while LiDAR methods face challenges with point cloud sparsity and performance degradation in adverse weather [19][41]. Group 5: SafeBEV 2.0 - Multi-Modal Vehicle Perception - Multi-modal BEV perception integrates data from cameras, LiDAR, and radar to enhance performance and robustness in challenging conditions [42][45]. - Fusion strategies are categorized into five types, including camera-radar, camera-LiDAR, radar-LiDAR, camera-LiDAR-radar, and temporal fusion, each leveraging the complementary characteristics of different sensors [42][45]. Group 6: SafeBEV 3.0 - Multi-Agent Collaborative Perception - The development of Vehicle-to-Everything (V2X) technology enables autonomous vehicles to exchange information and perform joint reasoning, overcoming the limitations of single-agent perception [15][16]. - Collaborative perception aggregates multi-source sensor data in a unified BEV space, facilitating global environmental modeling and enhancing safety navigation in dynamic traffic [15][16]. Group 7: Challenges and Future Directions - The article identifies key challenges in open-world scenarios, such as open-set recognition, large-scale unlabeled data, sensor performance degradation, and communication delays among agents [17]. - Future research directions include the integration of BEV perception with end-to-end autonomous driving systems, embodied intelligence, and large language models [17].
论文解读之港科PLUTO:首次超越Rule-Based的规划器!
自动驾驶之心· 2025-09-15 23:33
Core Viewpoint - The article discusses the development and features of the PLUTO model within the end-to-end autonomous driving domain, emphasizing its unique two-stage architecture and its direct encoding of structured perception outputs for downstream control tasks [1][2]. Summary by Sections Overview of PLUTO - PLUTO is characterized by its three main losses: regression loss, classification loss, and imitation learning loss, which collectively contribute to the model's performance [7]. - Additional auxiliary losses are incorporated to aid model convergence [9]. Course Introduction - The article introduces a new course titled "End-to-End and VLA Autonomous Driving," developed in collaboration with top algorithm experts from domestic leading manufacturers, aimed at addressing the challenges faced by learners in this rapidly evolving field [12][15]. Learning Challenges - The course addresses the difficulties learners face due to the fast-paced development of technology and the fragmented nature of knowledge across various domains, making it hard for beginners to grasp the necessary concepts [13]. Course Features - The course is designed to provide quick entry into the field, build a framework for research capabilities, and combine theory with practical applications [15][16][17]. Course Outline - The course consists of several chapters covering topics such as the history and evolution of end-to-end algorithms, background knowledge on various technologies, and detailed discussions on both one-stage and two-stage end-to-end methods [20][21][22][29]. Practical Application - The course includes practical assignments, such as RLHF fine-tuning, allowing students to apply their theoretical knowledge in real-world scenarios [31]. Instructor Background - The instructor, Jason, has a strong academic and practical background in cutting-edge algorithms related to end-to-end and large models, contributing to the course's credibility [32]. Target Audience and Expected Outcomes - The course is aimed at individuals with a foundational understanding of autonomous driving and related technologies, with the goal of elevating their skills to the level of an end-to-end autonomous driving algorithm engineer within a year [36].
作为研究,VLA至少提供了一种摆脱无尽corner case的可能性!
自动驾驶之心· 2025-09-15 03:56
Core Viewpoint - VLA (Vision-Language-Action) is emerging as a mainstream keyword in autonomous driving, with new players rapidly entering the field and industrial production accelerating, while academia continues to innovate and compete [1][2]. Summary by Sections 1. VLA Research and Development - The VLA model represents a shift from traditional modular architectures to a unified end-to-end model that directly maps raw sensor inputs to driving control commands, addressing previous bottlenecks in autonomous driving technology [3][4]. - Traditional modular architectures (L2-L4) have clear advantages in terms of logic and independent debugging but suffer from cumulative error effects and information loss, making them less effective in complex traffic scenarios [4][5]. 2. VLA Model Advantages - The introduction of VLA models leverages the strengths of large language models (LLMs) to enhance interpretability, reliability, and the ability to generalize to unseen scenarios, thus overcoming limitations of earlier models [5][6]. - VLA models can explain their decision-making processes in natural language, improving transparency and trust in autonomous systems [5][6]. 3. Course Objectives and Structure - The course aims to provide a systematic understanding of VLA, helping participants develop practical skills in model design and research paper writing, while also addressing common challenges faced by newcomers in the field [6][7]. - The curriculum includes 12 weeks of online group research, followed by 2 weeks of paper guidance and 10 weeks of paper maintenance, focusing on both theoretical knowledge and practical coding skills [7][8]. 4. Enrollment and Requirements - The program is designed for a small group of 6 to 8 participants, targeting individuals with a foundational understanding of deep learning and basic programming skills [11][16]. - Participants are expected to engage actively in discussions and complete assignments on time, maintaining academic integrity throughout the course [20][29]. 5. Course Highlights - The course offers a comprehensive learning experience with a multi-faceted teaching approach, including guidance from experienced mentors and a structured evaluation system to track progress [23][24]. - Participants will gain access to essential resources, including datasets and baseline codes, to facilitate their research and experimentation [24][25].
端到端再进化!用扩散模型和MoE打造会思考的自动驾驶Policy(同济大学)
自动驾驶之心· 2025-09-14 23:33
Core Viewpoint - The article presents a novel end-to-end autonomous driving strategy called Knowledge-Driven Diffusion Policy (KDP), which integrates diffusion models and Mixture of Experts (MoE) to enhance decision-making capabilities in complex driving scenarios [4][72]. Group 1: Challenges in Current Autonomous Driving Approaches - Existing end-to-end methods face challenges such as inadequate handling of multimodal distributions, leading to unsafe or hesitant driving behaviors [2][8]. - Reinforcement learning methods require extensive data and exhibit instability during training, making them difficult to scale in high-safety real-world scenarios [2][8]. - Recent advancements in large models, including visual-language models, show promise in understanding scenes but struggle with inference speed and safety in continuous control scenarios [3][10]. Group 2: Diffusion Models and Their Application - Diffusion models are transforming generative modeling in various fields, offering a robust way to express diverse driving choices while maintaining temporal consistency and training stability [3][12]. - The diffusion policy (DP) treats action generation as a "denoising" process, effectively addressing the diversity and long-term stability issues in driving decisions [3][12]. Group 3: Mixture of Experts (MoE) Framework - MoE technology allows for the activation of a limited number of experts on demand, enhancing computational efficiency and modularity in large models [3][15]. - In autonomous driving, MoE has been applied for multi-task strategies, but existing designs often limit expert reusability and flexibility [3][15]. Group 4: Knowledge-Driven Diffusion Policy (KDP) - KDP combines the strengths of diffusion models and MoE, ensuring diverse and stable trajectory generation while organizing experts into structured "knowledge units" for flexible combination based on different driving scenarios [4][6]. - Experimental results demonstrate KDP's advantages in diversity, stability, and generalization compared to traditional methods [4][6]. Group 5: Experimental Validation - The method was evaluated in a simulation environment with diverse driving scenarios, showing superior performance in safety, generalization, and efficiency compared to existing baseline models [39][49]. - The KDP framework achieved a 100% success rate in simpler scenarios and maintained high performance in more complex environments, indicating its robustness [57][72].