Workflow
自动驾驶之心
icon
Search documents
同济大学最新!GEMINUS:端到端MoE实现闭环新SOTA,性能大涨近8%~
自动驾驶之心· 2025-07-22 12:46
Core Viewpoint - The article presents GEMINUS, a novel end-to-end autonomous driving framework that integrates a dual-aware mixture of experts (MoE) architecture, achieving state-of-the-art performance in driving score and success rate using monocular vision input [1][2][49]. Summary by Sections Introduction - GEMINUS addresses the limitations of traditional single-modal planning methods in autonomous driving by introducing a framework that combines a global expert and a scene-adaptive experts group, along with a dual-aware router to enhance adaptability and robustness in diverse driving scenarios [1][6]. Background - The article discusses the evolution of end-to-end autonomous driving systems, highlighting the shift from modular approaches to unified models that directly map sensor inputs to control signals, thus reducing engineering workload and leveraging rich sensor information [4][8]. MoE Architecture - The MoE architecture has shown promise in handling complex data distributions, providing fine-grained scene adaptability and specialized behavior generation, which helps mitigate the mode averaging problem prevalent in existing models [5][11]. GEMINUS Framework - GEMINUS consists of a global expert trained on the overall dataset for robust performance and scene-adaptive experts trained on specific scene subsets for adaptability. The dual-aware router dynamically activates the appropriate expert based on scene features and routing uncertainty [6][18]. Experimental Results - GEMINUS outperformed existing methods in the Bench2Drive closed-loop benchmark, achieving a driving score improvement of 7.67% and a success rate increase of 22.06% compared to the original single-expert baseline model [2][36][49]. Ablation Studies - The ablation studies revealed that the scene-aware routing mechanism significantly enhances model performance, while the integration of uncertainty-aware routing and global experts further improves robustness and stability in ambiguous scenarios [40][41]. Conclusion - GEMINUS demonstrates a significant advancement in end-to-end autonomous driving, achieving state-of-the-art performance with monocular vision input and highlighting the importance of tailored MoE frameworks to address the complexities of real-world driving scenarios [49][50].
小米提出DriveMRP:合成难例数据+视觉提示事故识别率飙至88%!
自动驾驶之心· 2025-07-22 12:46
Core Viewpoint - The article discusses advancements in autonomous driving technology, specifically focusing on the DriveMRP framework, which synthesizes high-risk motion data to enhance the motion risk prediction capabilities of vision-language models (VLMs) [1][4]. Background and Core Objectives - Autonomous driving technology has rapidly developed, but accurately predicting the safety of ego vehicle movements in rare high-risk scenarios remains a significant challenge. Existing trajectory evaluation methods often provide a single reward score, lacking risk type explanation and decision-making support [1]. Limitations of Existing Methods - Rule-based methods rely heavily on external world models and are sensitive to perception errors, making them difficult to generalize to complex real-world scenarios, such as extreme weather conditions [2]. Core Innovative Solutions - **DriveMRP-10K**: A synthetic high-risk motion dataset containing 10,000 high-risk scenarios, generated through a "human-in-the-loop" mechanism, enhancing the VLM's motion risk prediction capabilities [4]. - **DriveMRP-Agent**: A VLM framework that improves risk reasoning by using inputs like BEV layout and scene images [5]. - **DriveMRP-Metric**: Evaluation metrics that assess model performance through high-risk trajectory synthesis and automatic labeling of motion attributes [5]. Performance Improvement - On the DriveMRP-10K dataset, the DriveMRP-Agent achieved a scene understanding metric (ROUGE-1-F1) of 69.08 and a motion risk prediction accuracy of 88.03%, significantly surpassing other VLMs. The accident identification accuracy improved from 27.13% to 88.03% [7][8]. Dataset Effectiveness - The DriveMRP-10K dataset significantly enhances the performance of various general VLMs, demonstrating its "plug-and-play" enhancement capability [10]. Key Component Ablation Experiments - The inclusion of global context in the model led to significant improvements in scene understanding and risk prediction metrics, highlighting the importance of global information for reasoning [12].
8万条!清华开源VLA数据集:面向自动驾驶极端场景,安全提升35%
自动驾驶之心· 2025-07-22 12:46
Core Viewpoint - The article discusses the development of the Impromptu VLA dataset, which aims to address the data scarcity issue in unstructured driving environments for autonomous driving systems. It highlights the dataset's potential to enhance the performance of vision-language-action models in complex scenarios [4][29]. Dataset Overview - The Impromptu VLA dataset consists of over 80,000 meticulously constructed video clips, extracted from more than 2 million original materials across eight diverse open-source datasets [5][29]. - The dataset focuses on four key unstructured challenges: boundary-ambiguous roads, temporary traffic rule changes, unconventional dynamic obstacles, and complex road conditions [12][13]. Methodology - The dataset construction involved a multi-step process, including data collection, scene classification, and multi-task annotation generation, utilizing advanced visual-language models (VLMs) for scene understanding [10][17]. - A rigorous manual verification process was implemented to ensure high-quality annotations, with significant F1 scores achieved for various categories, confirming the reliability of the VLM-based annotation process [18]. Experimental Validation - The effectiveness of the Impromptu VLA dataset was validated through comprehensive experiments, showing significant performance improvements in mainstream autonomous driving benchmarks. For instance, the average score in the closed-loop NeuroNCAP test improved from 1.77 to 2.15, with a reduction in collision rates from 72.5% to 65.5% [6][21]. - In open-loop trajectory prediction evaluations, models trained with the Impromptu VLA dataset achieved L2 errors as low as 0.30 meters, demonstrating competitive performance compared to leading methods that rely on larger proprietary datasets [24]. Conclusion - The Impromptu VLA dataset serves as a critical resource for developing more robust and adaptive autonomous driving systems capable of handling complex real-world scenarios. The research confirms the dataset's significant value in enhancing perception, prediction, and planning capabilities in unstructured driving environments [29].
聊聊自动驾驶闭环仿真和3DGS!
自动驾驶之心· 2025-07-22 12:46
Core Viewpoint - The article discusses the development and implementation of the Street Gaussians algorithm, which aims to efficiently model dynamic street scenes for autonomous driving simulations, addressing previous limitations in training and rendering speeds [2][3]. Group 1: Background and Challenges - Previous methods faced challenges such as slow training and rendering speeds, as well as inaccuracies in vehicle pose tracking [3]. - The Street Gaussians algorithm represents dynamic urban street scenes as a combination of point-based backgrounds and foreground objects, utilizing optimized vehicle tracking poses [3][4]. Group 2: Technical Implementation - The background model is represented as a set of points in world coordinates, each assigned a 3D Gaussian to depict geometric shape and color, with parameters including covariance matrices and position vectors [8]. - The object model for moving vehicles includes a set of optimizable tracking poses and point clouds, with similar Gaussian attributes to the background model but defined in local coordinates [11]. Group 3: Innovations in Appearance Modeling - The article introduces a 4D spherical harmonic model to encode temporal information into the appearance of moving vehicles, reducing storage costs compared to traditional methods [12]. - The effectiveness of the 4D spherical harmonic model is demonstrated, showing significant improvements in rendering results and reducing artifacts [16]. Group 4: Initialization Techniques - Street Gaussians utilizes aggregated LiDAR point clouds for initialization, addressing the limitations of traditional SfM point clouds in urban environments [17]. Group 5: Course and Learning Opportunities - The article promotes a specialized course on 3D Gaussian Splatting (3DGS), covering various subfields and practical applications in autonomous driving, aimed at enhancing understanding and implementation skills [26][30].
行车报漏检了,锅丢给了自动标注。。。
自动驾驶之心· 2025-07-22 07:28
Core Viewpoint - The article discusses the challenges and methodologies in automating the labeling of training data for occupancy networks (OCC) in autonomous driving, emphasizing the need for high-quality data to improve model generalization and safety [2][10]. Group 1: OCC and Its Importance - The occupancy network aims to partition space into small grids to predict occupancy, addressing irregular obstacles like fallen trees and other background elements [3][4]. - Since Tesla's announcement of OCC in 2022, it has become a standard in pure vision autonomous driving solutions, leading to a high demand for training data labeling [2][4]. Group 2: Challenges in Automated Labeling - The main challenges in 4D automated labeling include: 1. High temporal and spatial consistency requirements for tracking dynamic objects across frames [9]. 2. Complexity in fusing multi-modal data from various sensors [9]. 3. Difficulty in generalizing to dynamic scenes due to unpredictable behaviors of traffic participants [9]. 4. The contradiction between labeling efficiency and cost, as high precision requires manual verification [9]. 5. High requirements for generalization in production scenarios, necessitating data extraction from diverse environments [9]. Group 3: Training Data Generation Process - The common process for generating OCC training ground truth involves: 1. Ensuring consistency between 2D and 3D object detection [8]. 2. Comparing with edge models [8]. 3. Involving manual labeling for quality control [8]. Group 4: Course Offerings - The article promotes a course on 4D automated labeling, covering the entire process and core algorithms, aimed at learners interested in the autonomous driving data loop [10][26]. - The course includes practical exercises and addresses real-world challenges in the field, enhancing algorithmic capabilities [10][26]. Group 5: Course Structure - The course is structured into several chapters, including: 1. Basics of 4D automated labeling [11]. 2. Dynamic obstacle labeling [13]. 3. Laser and visual SLAM reconstruction [14]. 4. Static element labeling based on reconstruction [16]. 5. General obstacle OCC labeling [18]. 6. End-to-end ground truth labeling [19]. 7. Data loop topics, addressing industry pain points and interview preparation [21].
自动驾驶之心三周年优惠就要结束啦,最后一天...
自动驾驶之心· 2025-07-22 07:28
Core Viewpoint - The article emphasizes the importance of long-term value creation over short-term economic returns in the context of entrepreneurship and innovation in the AI education and autonomous driving sectors [2][4][8]. Group 1: Company Progress and Developments - The company has made significant advancements in its third year, establishing four key intellectual properties (IPs): Autonomous Driving Heart, Embodied Intelligence Heart, 3D Vision Heart, and Large Model Heart, with a focus on both industry and academia [1]. - The company has expanded its business model from purely online education to a comprehensive service platform that includes hardware, offline training, and job placement services [1]. - A new offline office has been established in Hangzhou, and several talented individuals have joined the team to support these initiatives [1]. Group 2: Industry Insights and Challenges - The article discusses the challenges faced by entrepreneurs who prioritize short-term gains, highlighting the need for a deeper commitment to long-term value creation [2][4]. - It reflects on the competitive landscape, noting that many companies in the AI education sector tend to imitate rather than innovate, which can hinder sustainable growth [7][8]. - The importance of understanding market needs and addressing pain points through thorough research is emphasized as a critical factor for success [4]. Group 3: Future Plans and Goals - The company plans to transition from being a purely educational entity to a technology company, with ongoing investments in research and development across multiple fields [9]. - The goal is to make AI education accessible to all students in need, ensuring that AI is easier to learn and apply [10]. - The company aims to complete the establishment of its entire system by the second half of 2025, entering a phase of stable operations [9].
近日某头部自驾公司数据算法核心负责人离职。。。
自动驾驶之心· 2025-07-22 02:18
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 据悉,近日某头部自驾公司智能驾驶团队迎来重要组织架构调整。 原数据算法核心负责人已正式离职。该负责人曾主导完成了千万Clips的数据交付,并搭建了近两代量产及端到端方案的云端数据闭环链路。与此同时,该公司引入 多位行业顶尖专家,进一步强化团队技术实力,彰显其在智能驾驶领域"不惜代价"投入的决心。 数据驱动体验飞跃在数据层面,该公司近期交付的1000万Clips版端到端辅助驾驶系统成为行业标杆。该系统通过海量高质量数据训练,显著提升了实际驾驶体验: 该公司通过三大核心技术保障数据质量: 硬软件协同发力 持续扩大技术优势硬件上,该公司最新车型全系标配行业顶配感知套件,包括高性能计算芯片、激光雷达及多颗高清摄像头,提供充沛算力储备;软件层面,1000 万Clips系统已陆续通过OTA推送至用户端,后续将结合大模型能力持续迭代。 长期投入 锚定智能驾驶未来,该公司宣布智能驾驶领域首期投入规模巨大,专属团队超千人,测试车辆达数百台。同时,公司联合顶尖AI实验室与高校开展前沿技术研究, 进一步加速技术突破。 纵向舒适性提 ...
分析了102个VLA模型、26个数据集和12个仿真平台
自动驾驶之心· 2025-07-22 02:18
Core Viewpoint - The article discusses the transformative breakthrough of Visual-Language-Action (VLA) models in robotics, emphasizing their integration of visual perception, natural language understanding, and embodied control within a unified learning framework. It highlights the development and evaluation of 102 VLA models, 26 foundational datasets, and 12 simulation platforms, identifying current challenges and future directions for enhancing robotic autonomy and adaptability [3][4][6]. Group 1: VLA Models and Framework - VLA models represent a new frontier in robotic intelligence, enabling robots to perceive visual environments, understand natural language commands, and execute meaningful actions, bridging the semantic gap between various modalities [7][9]. - The architecture of VLA models integrates visual, language, and proprioceptive encoders into a diffusion backbone network to generate control commands, facilitating end-to-end processing of multimodal inputs [11][12]. - The development of effective VLA models relies on large-scale, diverse multimodal datasets and realistic simulation platforms, which are crucial for training models to robustly understand language instructions and perceive visual environments [5][30]. Group 2: Datasets and Evaluation - The article outlines the evolution of VLA datasets, noting that early datasets focused on discrete decision-making in constrained environments, while recent datasets incorporate richer sensory streams and longer task durations, addressing the need for complex multimodal control challenges [21][22][29]. - A comprehensive benchmarking strategy is proposed to evaluate datasets based on task complexity and modality richness, highlighting the need for new datasets that integrate high task difficulty with extensive multimodal inputs [24][28]. - The analysis reveals a gap in current VLA benchmarks, particularly in combining long-duration, multi-skill control with diverse multimodal integration, indicating a promising direction for future dataset development [29][43]. Group 3: Simulation Tools - Simulation environments are critical for VLA research, enabling the generation of large-scale, repeatable, and richly annotated data that surpasses physical world limitations [30][31]. - Various advanced simulation platforms, such as AI2-THOR and NVIDIA Isaac Sim, provide high-fidelity physical effects and customizable multimodal sensors, essential for developing robust VLA models [32][33]. - The integration of simulation tools with VLA datasets accelerates the collaborative development of control algorithms and benchmark datasets, ensuring advancements in multimodal perception are effectively evaluated before deployment in real robotic platforms [30][33]. Group 4: Applications and Challenges - VLA models are categorized into six broad application areas, including manipulation and task generalization, autonomous mobility, human assistance, and interaction, showcasing their versatility across various robotic tasks [34][35]. - The article identifies key challenges in VLA model architecture, such as tokenization and vocabulary alignment, modality fusion, and cross-entity generalization, which need to be addressed to enhance model performance and adaptability [39][40][41]. - Data challenges are also highlighted, including task diversity, modality imbalance, annotation quality, and the trade-off between realism and scale in datasets, which hinder the development of robust general-purpose VLA models [42][43].
字节跳动2026校招来了!大模型算法、多模态、CV类有较多坑位
自动驾驶之心· 2025-07-22 01:47
Core Viewpoint - ByteDance has opened its campus recruitment programs, including the Jindouyun Talent Program and the Top Seed Program, targeting different groups of doctoral students with varying focuses and application difficulties [1]. Group 1: Jindouyun Talent Program - The Jindouyun Talent Program is aimed at doctoral students graduating between September 2022 and August 2026 for full-time positions, and those graduating in September 2025 and later for internship positions [2]. - The program has relaxed the recruitment restrictions for past graduates, allowing those who graduated in 2022 to apply [2]. - It covers eight major fields, including large model applications, search/recommendation/advertising, computer architecture, AI safety, hardware, AI coding, video architecture, and AIGC, balancing academic research with industrial application and supporting paper publication [2]. Group 2: Top Seed Program - The Top Seed Program primarily targets doctoral students graduating in 2026 and also opens recruitment for research interns [3]. - It focuses on core technologies of large models, such as large language models (LLM), multimodal generation and understanding, machine learning algorithms, and speech [3]. - The goal of this program is to cultivate more top-tier talent, offering high compensation and computational support [3]. Group 3: Community and Resources - The AutoRobo Knowledge Community is designed for job seekers in autonomous driving, embodied intelligence, and large models, currently with nearly 1,000 members from various companies [6][8]. - The community provides resources such as interview questions, industry reports, salary negotiation tips, and internal referrals [8][9]. - It also compiles a hundred interview questions related to autonomous driving and embodied intelligence, covering various technical aspects [12][13][17]. Group 4: Industry Reports and Insights - The community offers in-depth industry reports to help members understand the current state, development trends, and market opportunities in various fields, including robotics and embodied intelligence [18]. - Reports include topics like the world robotics report, investment reports in embodied intelligence, and the development of humanoid robots [18]. Group 5: Interview Experiences and Tips - The community shares successful and unsuccessful interview experiences across various companies and positions, providing insights into the interview process [20]. - It also compiles common interview questions and skills required for algorithm positions in the autonomous driving sector [25].
为什么不推荐研究生搞强化学习研究?
自动驾驶之心· 2025-07-21 11:18
原文链接: https://www.zhihu.com/question/1900927726795334198 点击下方 卡片 ,关注" 大模型之心Tech "公众号 戳我-> 领取大模型巨卷干货 >> 点击进入→ 大模型没那么大Tech技术交流群 本文只做学术分享,如有侵权,联系删文 ,自动驾驶课程学习与技术交流群事宜,也欢迎添加小助理微信AIDriver004做进一步咨 询 写在前面 我已经很久没答学术上的问题了,因为最近审的申请书一半都是强化学习相关的?所以知乎老给我推强化 学习的各种东西……我就来简单的谈一谈强化学习吧。 强化学习如果说你要是 读到硕士研究生为止 ,哪怕你读的是清华北大的,最重要的基本功就是 调包 ,搞 清楚什么时候该调什么包就可以了,其次就是怎么排列组合,怎么缩小解空间,对一些算法只需要有个基 本的流程性了解就好了。 如果你读的是 博士 ,建议 换个方向 ,我觉得在现在的强化学习上雕花就是浪费时间和生命,当然你要是 以发很多papers,混个教职当然可以,就是你可能很久都做不出真正很好的工作来,混口饭吃也不注重这 个。 我对强化学习的感受就是 古老且原始 ,感觉就好像现在我还拿着一 ...