自动驾驶之心
Search documents
强化学习应用在自动驾驶中的一些思考
自动驾驶之心· 2025-12-23 00:53
Core Viewpoint - The article discusses the application of reinforcement learning (RL) fine-tuning in trajectory planning for autonomous driving, emphasizing the transition from open-loop to closed-loop training methods to enhance the effectiveness of training models [3][4]. Group 1: Training Methodology - The mainstream planning modules based on learning typically use imitation learning, which can struggle with out-of-distribution scenarios during real-world testing [3]. - A closed-loop training approach is proposed, which simulates real vehicle testing environments, making it more effective than open-loop training [4]. - The article introduces a network structure based on Waymo's previous work, MotionLM, which outputs trajectories in an autoregressive manner, ensuring causal relationships are maintained [4][6]. Group 2: Input and Output Structure - The network's input is designed to be scene-centered, summarizing static information over a specified time frame rather than relying on the current frame alone, which helps prevent the vehicle from navigating outside the perceived road [6]. - Many imitation learning methods combine single-frame perception with ground truth (GT) data over several seconds, which can lead to causal inconsistencies if the perception range is limited [7]. Group 3: Reward Function and Training Phases - The training process consists of two phases: pretraining and reinforcement learning, with a simple reward function that balances efficiency and safety by considering both GT fitting and collision avoidance [11]. - The reward function is calculated by normalizing the rewards across all samples and time steps, allowing for the omission of a critic network, similar to the GRPO method [13]. Group 4: Challenges and Future Directions - The article notes that many imitation learning methods introduce auxiliary losses that can lead to undesirable model outputs, highlighting the limitations of open-loop training [14]. - The core value of reinforcement learning lies in closed-loop learning, which can significantly enhance model capabilities even with smaller datasets [14].
工业界大佬带队!三个月搞定自动驾驶世界模型......
自动驾驶之心· 2025-12-22 09:20
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 世界模型,业内各家公司都在卷的技术。目前的技术趋势已经确定,世界模型可以应用到数据生成、闭环仿真等等。国外特斯拉是基于前馈GS做的世界仿真器,国 内小米、理想是利用世界模型做长尾数据生成和端到端闭环仿真。这一岗位的需求,明年会更旺盛! 所以自动驾驶之心联合 工业界大佬 共同开展了新的《 世界模型与自动驾驶小班课 》, 课程聚焦于通用世界模型、视频生成、OCC生成等世界模型算法,涵盖特 斯拉世界模型、李飞飞团队Marble等。欢迎大家加入学习~ 早鸟优惠!开课即止~ 讲师介绍 Jason:C9本科+QS50 PhD,已发表CCF-A论文2篇,CCF-B论文若干。现任国内TOP主机厂算法专家,目前从事端到端、大模型、世界模型等前沿算法的预研和量 产,并已主持和完成多项自动驾驶感知和端到端算法的产品量产交付,拥有丰富的端到端算法研发和实战经验。 课程大纲 这门课程讲如何展开 第一章:世界模型介绍 第三章:通用世界模型探讨 第三章聚焦在大家最关心的通用世界模型和最近自驾的热门工作。 李飞飞团队的Marble、Deep ...
港大领衔DrivePI:统一自动驾驶理解、感知、预测和规划的空间智能4D MLLM
自动驾驶之心· 2025-12-22 09:20
Core Viewpoint - DrivePI is introduced as a novel unified spatial-aware 4D multimodal large language model (MLLM) framework that integrates coarse-grained language understanding with fine-grained 3D perception capabilities, bridging the gap between vision-based and VLA paradigms in autonomous driving [2][38]. Group 1: Project Overview - DrivePI is developed collaboratively by Hong Kong University, leading the project with contributions from companies like Huawei and universities such as Tianjin University and Huazhong University of Science and Technology [2]. - The model is designed to perform spatial understanding, 3D perception, prediction, and planning tasks through end-to-end optimization, showcasing its capability to handle complex autonomous driving scenarios [4][6]. Group 2: Technical Innovations - DrivePI incorporates a multimodal perception approach, utilizing LiDAR alongside camera images to enhance spatial understanding and provide accurate 3D geometric information [11]. - The model generates intermediate fine 3D perception and prediction representations, ensuring reliable spatial awareness and enhancing the interpretability and safety of autonomous driving systems [11]. - A rich data engine is developed to seamlessly integrate 3D occupancy and flow representations into natural language scene descriptions, allowing the model to understand complex spatiotemporal dynamics [11]. Group 3: Performance Metrics - DrivePI outperforms existing VLA models, achieving a 2.5% higher average accuracy on nuScenes-QA compared to OpenDriveVLA-7B and reducing collision rates by 70% from 0.37% to 0.11% [5][16]. - In 3D occupancy and flow prediction, DrivePI achieved 49.3% OccScore and 49.3% RayIoU, surpassing the FB-OCC method by 10.3 percentage points [15][21]. - The model demonstrated a 32% reduction in L2 error for trajectory planning compared to VAD, showcasing its effectiveness in planning tasks [16]. Group 4: Data Engine and Annotation - The data engine for DrivePI operates in three main stages, focusing on generating diverse question-answer pairs for 4D spatial understanding and planning reasoning [12][18]. - Scene understanding annotations are generated to avoid confusion in distinguishing different views, enhancing the model's ability to interpret various perspectives [18]. Group 5: Ablation Studies and Insights - Ablation studies indicate that combining text and visual heads improves performance across most tasks, demonstrating the effectiveness of unifying text understanding with 3D perception, prediction, and planning [23]. - The impact of different text data scales was explored, revealing significant improvements in occupancy state prediction accuracy when increasing the training data size [26]. Group 6: Future Prospects - DrivePI is expected to inspire future research directions in autonomous driving by enhancing the interpretability and decision-making capabilities of systems through language reasoning and detailed 3D outputs [38].
研究生实验到什么程度可以写小论文?
自动驾驶之心· 2025-12-22 03:23
Core Viewpoint - The article emphasizes the importance of timely submission of academic papers, particularly for graduate students, highlighting that a complete story in research is more valuable than novelty [1]. Group 1: Academic Guidance Services - The company offers a paper guidance service aimed at efficiently producing research results within a limited timeframe, helping students avoid common pitfalls in self-writing [2]. - The guidance covers various advanced topics such as reinforcement learning, 3D object detection, and multi-sensor fusion, among others, providing tailored advice based on individual research directions [3]. - The service is designed to assist students who face challenges such as unclear direction, difficulty in code reproduction, and lack of systematic research training [5]. Group 2: Instructor Qualifications - All instructors associated with the service are from globally recognized universities ranked in the top 100 by QS, with multiple publications in A-level conferences and extensive project experience [6]. Group 3: Comprehensive Academic Support - The company provides a wide range of academic support services, including assistance with journal papers, conference papers, and thesis projects, ensuring a comprehensive approach to academic success [8]. - The service is results-oriented, offering continuous support until the paper is submitted, with a focus on enhancing coding skills alongside research guidance [8]. Group 4: FAQs and Additional Information - The company assures that even students with no prior experience can publish papers by following structured courses, with the potential to produce a small paper within six months [11]. - Outstanding students may receive recommendation letters from prestigious institutions and opportunities for internships in leading companies, indicating that publishing papers is just the beginning of their academic journey [11]. - Pricing for the services varies based on the publication target, with detailed consultations provided to tailor support to individual needs [11].
DiffusionDriveV2核心代码解析
自动驾驶之心· 2025-12-22 03:23
Core Viewpoint - The article discusses the DiffusionDrive model, which utilizes a truncated diffusion approach for end-to-end autonomous driving, emphasizing its architecture and the integration of reinforcement learning to enhance trajectory planning and safety [1]. Group 1: Model Architecture - DiffusionDriveV2 incorporates reinforcement learning constraints within a truncated diffusion modeling framework for autonomous driving [3]. - The model architecture includes environment encoding through bird's-eye view (BEV) features and vehicle status, facilitating effective data processing [5]. - The trajectory planning module employs multi-scale BEV features to enhance the model's ability to predict vehicle trajectories accurately [8]. Group 2: Trajectory Generation - The model generates trajectories by first clustering true future trajectories of the vehicle using K-Means to create anchors, which are then perturbed with Gaussian noise to simulate variations [12]. - The trajectory prediction process involves cross-attention mechanisms that integrate trajectory features with BEV features, enhancing the model's predictive capabilities [15][17]. - The final trajectory is derived from the predicted trajectory offsets combined with the original trajectory, ensuring continuity and coherence [22]. Group 3: Reinforcement Learning and Safety - The Intra-Anchor GRPO method is proposed to optimize strategies within specific behavioral intentions, enhancing safety and goal-oriented trajectory generation [27]. - A comprehensive scoring system evaluates generated trajectories based on safety, comfort, rule compliance, progress, and feasibility, ensuring robust performance in various driving scenarios [28]. - The model incorporates a modified advantage estimation approach to provide clear learning signals, penalizing trajectories that result in collisions [30]. Group 4: Noise and Exploration - The model introduces multiplicative noise to maintain trajectory smoothness, addressing the inherent scale inconsistencies between proximal and distal trajectory segments [33]. - This approach contrasts with additive noise, which can disrupt trajectory integrity, thereby improving the quality of exploration during training [35]. Group 5: Loss Function and Training - The total loss function combines reinforcement learning loss with imitation learning loss to prevent overfitting and ensure general driving capabilities [39]. - The trajectory recovery and classification confidence contribute to the overall loss, guiding the model towards accurate trajectory predictions [42].
小米7篇论文入选顶会AAAI,前沿领域全覆盖!
自动驾驶之心· 2025-12-22 03:23
Core Viewpoint - Xiaomi has made significant strides in AI research, with seven papers accepted at AAAI 2026, showcasing its comprehensive capabilities across various AI domains, including sound editing, speech Q&A, embodied intelligence, and autonomous driving [5][6][41]. Group 1: Research Achievements - Xiaomi's seven accepted papers cover a wide range of AI research areas, demonstrating its commitment to foundational technology and long-term investment in AI [6][41]. - The research topics include sound effect editing, speech question answering, 3D embodied agents, visual language navigation, retrieval models, inference decoding strategies, and autonomous driving [6][41]. Group 2: AutoLink Framework - AutoLink addresses the challenges of large-scale text-to-SQL by allowing models to explore database schemas iteratively rather than loading all data at once, achieving a strict recall of 97.4% on Bird-Dev and 91.2% on Spider-2.0-Lite [9][10]. - This framework enables LLMs to act like intelligent agents, dynamically identifying relevant schema parts for SQL generation, thus enhancing efficiency and scalability [10]. Group 3: SpecFormer Model - SpecFormer redefines the role of draft models in speculative decoding by integrating unidirectional and bidirectional attention, allowing for faster decoding without the need for complex draft trees [12][13][15]. - This model can understand context while generating predictions in parallel, leading to lower training costs and better hardware compatibility for large-scale deployments [15]. Group 4: CLSR for Long-form Speech - CLSR (Contrastive Language-Speech Retriever) improves long-form speech question answering by extracting relevant segments from lengthy audio recordings, enhancing accuracy and efficiency [17][20]. - This approach reduces irrelevant information and allows large models to focus on key content, significantly improving performance in speech Q&A tasks [20]. Group 5: AV-Edit for Sound Editing - AV-Edit revolutionizes sound effect editing by integrating visual, audio, and textual semantics, allowing for precise and contextually relevant sound modifications [21][24]. - The model utilizes a three-modal generative framework to achieve high-quality sound editing that aligns with video content, outperforming traditional methods [24]. Group 6: ORS3D for Task Scheduling - ORS3D introduces a new task definition for embodied agents, focusing on parallel task execution and efficient scheduling in 3D environments [26][29]. - The GRANT model incorporates scheduling tokens to optimize task execution, demonstrating competitive performance in language understanding and spatial reasoning [28][29]. Group 7: SpNav for Spatial Navigation - SpNav addresses the gap in embodied intelligence navigation by combining high-level human instructions with spatial understanding, enabling robots to navigate complex environments effectively [33][35]. - The framework utilizes a dataset of 10,000 trajectories to train agents in understanding spatial descriptions and executing precise navigation plans [35]. Group 8: VILTA for Autonomous Driving - VILTA (VLA-in-the-Loop Trajectory Adversary) enhances autonomous driving strategies by generating adversarial trajectories for rare and complex scenarios, improving system robustness [37][40]. - This method integrates visual language models to refine trajectory generation, ensuring that the resulting paths are both diverse and physically feasible [40].
业内团队负责人对Waymo基座模型的一些分析
自动驾驶之心· 2025-12-22 00:42
Core Insights - Waymo's latest blog discusses advancements in safety validation and explainability methods under a new end-to-end paradigm, the operational framework of its large-scale driving model, and the data flywheel concept [2][4][8] Group 1: Safety Validation and Explainability - The safety validation and explainability methods are closely tied to Waymo's foundational model, which operates on a dual system: a fast system focused on perception and a slow system based on a Vision-Language Model (VLM) [2][4] - The VLM is designed for complex semantic reasoning, utilizing rich camera data and fine-tuned on Waymo's driving data to handle rare and complex scenarios, such as navigating around a vehicle on fire [4][5][7] Group 2: Data Flywheel Concept - Waymo's data flywheel consists of an inner loop based on reinforcement learning for simulation-validation-vehicle integration and an outer loop based on real vehicle testing [8][11] - The insights from the data flywheel emphasize the importance of vehicle data mining and the reliance on world model-based generative simulations [12] Group 3: Foundation Model Applications - The foundational model serves three main purposes, including vehicle data extraction, cloud simulation, and evaluation for safety and explainability under the new paradigm [6][11] - The model's architecture allows for the transformation of vehicle trajectory prediction into a next-token prediction task, leveraging large language models for enhanced performance [5][11]
最近Feed-forward GS的工作爆发了
自动驾驶之心· 2025-12-22 00:42
Core Viewpoint - The article discusses the advancements in 3D Gaussian Splatting (3DGS) technology in the autonomous driving sector, highlighting the introduction of feed-forward GS algorithms and the need for effective learning pathways for newcomers in the field [2][4]. Group 1: Course Overview - A new course titled "3DGS Theory and Algorithm Practical Tutorial" has been developed to provide a comprehensive learning roadmap for 3DGS technology, covering both theoretical and practical aspects [4]. - The course is designed to help participants understand point cloud processing, deep learning theories, real-time rendering, and coding practices [4]. Group 2: Course Structure - The course consists of six chapters, starting with foundational knowledge in computer graphics and moving through the principles and algorithms of 3DGS, including dynamic and surface reconstruction [8][9]. - The third chapter focuses on the application of 3DGS in autonomous driving simulation, providing insights into key works and tools used in the industry [10]. - Subsequent chapters explore important research directions in 3DGS, including COLMAP extensions and depth estimation, as well as the emerging feed-forward 3DGS techniques [11][12]. Group 3: Target Audience and Requirements - The course is aimed at individuals with a background in computer graphics, visual reconstruction, and programming, specifically those familiar with Python and PyTorch [17]. - Participants are expected to have access to a GPU with a recommended capability of 4090 or higher to effectively engage with the course content [17].
中山&港科纯视觉方案:3DGS实现高精轨迹视频生成
自动驾驶之心· 2025-12-22 00:42
深蓝AI . 专注于人工智能、机器人与自动驾驶的学习平台。 来源 | 深蓝AI 原文链接: 纯视觉方案!中山大学&港科大新作:基于3DGS实现高精度轨迹视频生成 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 本文只做学术分享,如有侵权,联系删文 「 不修图、不依赖 LiDAR 」 以下文章来源于深蓝AI ,作者深蓝学院 在自动驾驶领域, 多轨迹、多视角的视频数据 几乎是刚需。 它不仅决定了 3D 重建的完整性,也直接影响世界模型和规划系统的泛化能力。但现实很骨感: 真实世界里,想采集同一条道路、不同横向位置、严格同步的多条驾驶视频,成本极高。要么多车协同,要么反复跑同一路段,还会带来时间、动态目 标不一致的问题。于是,研究者开始尝试: 能不能只用一条真实驾驶视频,自动"生成"另一条相邻轨迹的视频? 看似简单,实际却踩了两个大坑: 中山大学与香港科技大学提出了ReCamDriving,一个完全基于视觉、却能精确控制相机轨迹的新轨迹视频生成方法。 不修补、不靠 LiDAR,直接换一种相机控制思路。 标题: ...
死磕技术的自动驾驶黄埔军校,即将4500人了
自动驾驶之心· 2025-12-21 11:54
Core Insights - The article emphasizes the establishment of a comprehensive community for autonomous driving, aiming to provide a platform for knowledge sharing, technical discussions, and career opportunities in the field [21][25]. Group 1: Community and Learning Resources - The "Autonomous Driving Heart Knowledge Planet" has been created to facilitate discussions on academic and engineering issues related to autonomous driving, gathering members from renowned universities and leading companies in the industry [21][22]. - The community has compiled over 40 technical routes and resources, including open-source projects, datasets, and learning paths for various aspects of autonomous driving [22][40]. - Members can access exclusive learning videos and participate in discussions with industry experts, enhancing their understanding of the latest trends and technologies in autonomous driving [25][90]. Group 2: Technical Insights and Developments - Recent updates include insights from industry leaders on topics such as end-to-end autonomous driving, multi-modal large models, and the integration of various sensor technologies [6][10]. - The community has shared significant advancements in technologies like VLA (Vision Language Models), BEV (Bird's Eye View) perception, and 3D target detection, which are crucial for the development of autonomous systems [48][56]. - Discussions on practical applications and challenges in the industry, such as data processing, simulation frameworks, and real-world deployment strategies, are ongoing within the community [9][42]. Group 3: Career Development and Networking - The community offers job referral mechanisms and career advice, connecting members with potential employers in the autonomous driving sector [15][25]. - Regular interactions with industry veterans provide members with insights into job opportunities, skill requirements, and emerging trends in the autonomous driving landscape [10][95]. - The platform aims to grow its membership to nearly 10,000 within two years, fostering a vibrant network for both beginners and experienced professionals in the field [7][21].