自动驾驶之心 - filings, earnings calls, financial reports, news

自动驾驶之心

Search documents

自动驾驶之心· 2025-06-22 14:09

自动驾驶之心团队推出的教研一体轻量级解决方案，支持感知、定位、融合、导航、规划等多个功能平台，阿克曼底盘。重磅！预售来啦。面向科研&教学级自动驾驶全栈小车黑武士系列001正式开售了。世界太枯燥了，和我们一起做点有意思的事情吧。原价36999元，现在下单赠送3门课程（模型部署+点云3D检测+多传感器融合），优先锁定的安排组装发货。这两个月订单排满了，正在不断组装调试，5台及以上订单可以优惠哦！欢迎高校和研究院所批量采购。感兴趣的同学可以早点下单哦～ 1）黑武士001 黑武士支持二次开发和改装，预留了众多安装位置和接口，可以加装相机、毫米波雷达等传感器； 2）效果展示我们测试了室内、室外、地库等场景下感知、定位、融合、导航规划等功能；整体功能介绍本科生学习进阶+比赛；√ 研究生科研+发论文；√ 研究生找工作+项目；√ 高校实验室教具；√ 培训公司/职业院校教具；√ 户外公园行驶点云3D目标检测室内地库2D激光建图室内地库3D激光建图上下坡测试室外大场景3D建图室外夜间行驶 6）软件说明软件与语言框架：ROS、C++、python 支持一键启动，提供开发环境 3）硬件说明 | 主要传 ...

自动驾驶

科研&教学级自动驾驶全栈小车黑武士系列001

教研一体轻量级解决方案

自动驾驶

科研&教学级自动驾驶全栈小车黑武士系列001

教研一体轻量级解决方案

自动驾驶端到端VLA落地，算法如何设计？

自动驾驶之心· 2025-06-22 14:09

Core Insights - The article discusses the rapid advancements in end-to-end autonomous driving, particularly focusing on Vision-Language-Action (VLA) models and their applications in the industry [2][3]. Group 1: VLA Model Developments - The introduction of AutoVLA, a new VLA model that integrates reasoning and action generation for end-to-end autonomous driving, shows promising results in semantic reasoning and trajectory planning [3][4]. - ReCogDrive, another VLA model, addresses performance issues in rare and long-tail scenarios by utilizing a three-stage training framework that combines visual language models with diffusion planners [7][9]. - Impromptu VLA introduces a dataset aimed at improving VLA models' performance in unstructured extreme conditions, demonstrating significant performance improvements in established benchmarks [14][24]. Group 2: Experimental Results - AutoVLA achieved competitive performance metrics in various scenarios, with the best-of-N method reaching a PDMS score of 92.12, indicating its effectiveness in planning and execution [5]. - ReCogDrive set a new state-of-the-art PDMS score of 89.6 on the NAVSIM benchmark, showcasing its robustness and safety in driving trajectories [9][10]. - The OpenDriveVLA model demonstrated superior results in open-loop trajectory planning and driving-related question-answering tasks, outperforming previous methods on the nuScenes dataset [28][32]. Group 3: Industry Trends - The article highlights a trend among major automotive manufacturers, such as Li Auto, Xiaomi, and XPeng, to invest heavily in VLA model research and development, indicating a competitive landscape in autonomous driving technology [2][3]. - The integration of large language models (LLMs) with VLA frameworks is becoming a focal point for enhancing decision-making capabilities in autonomous vehicles, as seen in models like ORION and VLM-RL [33][39].

100+自动驾驶数据集，这5个你总得知道吧？

自动驾驶之心· 2025-06-22 01:35

Core Viewpoint - The article emphasizes the growing importance of autonomous driving technology and highlights the availability of over 100 high-quality datasets for developers and researchers in the field. It introduces five key datasets that cover various tasks from perception to visual odometry, providing valuable resources for both beginners and experienced engineers [2]. Dataset Summaries 1. KITTI Dataset - The KITTI dataset is one of the most classic and widely used benchmark datasets in the autonomous driving field. It was collected in Karlsruhe, Germany, using high-precision sensors such as stereo color/gray cameras, Velodyne 3D LiDAR, and GPS/IMU. The dataset includes annotations for various perception tasks, including stereo vision, optical flow, visual odometry, and 3D object detection and tracking, making it a standard for evaluating vehicle vision algorithms [3]. 2. nuScenes Dataset - nuScenes is a large-scale multi-sensor dataset released by Motional, covering 1,000 continuous driving scenes in Boston and Singapore, totaling approximately 15 hours of data. It includes a full suite of sensors: six cameras, five millimeter-wave radars, one top-mounted LiDAR, and IMU/GPS. The dataset provides around 1.4 million high-resolution camera images and 390,000 LiDAR scans, annotated with 3D bounding boxes for 23 object categories, making it suitable for research on complex urban road scenarios [5][7]. 3. Waymo Open Dataset - The Waymo Open Dataset, released by Google Waymo, is one of the largest open data resources for autonomous driving. It consists of two main parts: a perception dataset with 2,030 scenes of high-resolution camera and LiDAR data, and a motion dataset with 103,354 vehicle trajectories and corresponding 3D map information. This extensive multi-sensor dataset covers various times, weather conditions, and urban environments, serving as a benchmark for target detection, tracking, and trajectory prediction research [10][12]. 4. PathTrack Dataset - PathTrack is a dataset focused on person tracking, containing over 15,000 trajectories across 720 sequences. It utilizes a re-trained existing person matching network, significantly reducing the classification error rate. The dataset is suitable for 2D/3D object detection, tracking, and trajectory prediction tasks [13][14][15]. 5. ApolloScape Dataset - ApolloScape, released by Baidu Apollo, is a massive autonomous driving dataset characterized by its large volume and high annotation accuracy. It reportedly exceeds similar datasets in size by over ten times, containing hundreds of thousands of high-resolution images with pixel-level semantic segmentation annotations. ApolloScape defines 26 different semantic categories and includes complex road scenarios, making it applicable for perception, map construction, and simulation training [17][19].

Autonomous Driving

Visual Language Model

Visual Language Model

技术圈热议的π0/π0.5/A0，终于说清楚是什么了！功能/场景/方法论全解析~

自动驾驶之心· 2025-06-22 01:35

Core Insights - The article discusses the π0, π0.5, and A0 models, focusing on their architectures, advantages, and functionalities in robotic control and task execution [3][12][21]. π0 Model Structure - The π0 model is based on a pre-trained Vision-Language Model (VLM) and Flow Matching technology, integrating seven types of robots and over 68 tasks with more than 10,000 hours of data [3]. - It utilizes a VLM backbone, an Action Expert, and Cross-Embodiment Training to handle different robot action spaces [3]. π0 Advantages and Functions - The model can execute tasks directly from language prompts without additional fine-tuning, achieving a 20%-30% higher accuracy in task execution compared to baseline models [4][6]. - It supports complex task decomposition and high-frequency precise operations, generating continuous actions at a control frequency of up to 50Hz [4][6]. π0.5 Model Structure - The π0.5 model employs a two-stage training framework and a hierarchical architecture to learn from diverse data sources and generalize to new environments [7][9]. - It integrates a Vision-Language-Action (VLA) model that encodes multi-modal inputs into a unified sequence for decision-making [9]. π0.5 Advantages and Functions - The π0.5 model shows a 25%-40% higher success rate in tasks compared to π0, with a training speed improvement of three times due to mixed discrete-continuous action training [12][13]. - It effectively handles long-duration tasks and demonstrates zero-shot semantic understanding, allowing it to recognize and act on previously unseen objects [13][16]. A0 Model Structure - The A0 model features a layered architecture that focuses on Affordance understanding and action execution, utilizing a diffusion model for predicting contact points and trajectories [21][25]. - It integrates multi-source data to create a unified Affordance representation, enhancing its ability to perform complex tasks [26]. A0 Advantages and Functions - The A0 model exhibits cross-platform generalization capabilities, allowing deployment across various robotic platforms with high efficiency in spatial reasoning [26][27]. - It achieves an average success rate of 62.5% in tasks, with specific tasks like drawer opening reaching a 75% success rate [27].

理想最新DriveAction：探索VLA模型中类人驾驶决策的基准~

自动驾驶之心· 2025-06-21 13:15

Core Insights - The article discusses the introduction of the DriveAction benchmark, specifically designed for Vision-Language-Action (VLA) models in autonomous driving, addressing existing limitations in current datasets and evaluation frameworks [2][3][20]. Group 1: Research Background and Issues - The development of VLA models presents new opportunities for autonomous driving systems, but current benchmark datasets lack diversity in scenarios, reliable action-level annotations, and evaluation protocols aligned with human preferences [2]. - Existing benchmarks primarily rely on open-source data, which limits their ability to cover complex real-world driving scenarios, leading to a disconnect between evaluation results and actual deployment risks [3]. Group 2: DriveAction Benchmark Innovations - DriveAction is the first action-driven benchmark specifically designed for VLA models, featuring three core innovations: 1. Comprehensive coverage of diverse driving scenarios sourced from real-world data collected by production autonomous vehicles across 148 cities in China [5]. 2. Realistic action annotations derived from users' real-time driving operations, ensuring accurate capture of driver intentions [6]. 3. A tree-structured evaluation framework based on action-driven dynamics, integrating visual and language tasks to assess model decision-making in realistic contexts [7]. Group 3: Evaluation Results - Experimental results indicate that models perform best in the full process mode (V-L-A) and worst in the no-information mode (A), with average accuracy dropping by 3.3% without visual input and 4.1% without language input [14]. - Specific task evaluations reveal that models excel in dynamic and static obstacle tasks but struggle with navigation and traffic light tasks, highlighting areas for improvement [16][17]. Group 4: Significance and Value of DriveAction - The introduction of the DriveAction benchmark marks a significant advancement in the evaluation of autonomous driving systems, providing a more comprehensive and realistic assessment tool that can help identify model bottlenecks and guide system optimization [20].

Vision-Language-Action（VLA）模型

自动驾驶

Autos

DriveAction基准

Vision-Language-Action（VLA）模型

自动驾驶

Autos

DriveAction基准

MinMax-M1：超越DeepSeek，支持百万级token上下文

自动驾驶之心· 2025-06-21 13:15

以下文章来源于AIGC面面观，作者欠阿贝尔两块钱 AIGC面面观 . 整理LLM、AIGC的入门笔记 | 论文学习笔记 | 一线大厂面经 | 探索AIGC落地作者 | 欠阿贝尔两块钱来源 | AIGC面面观点击下方卡片，关注" 自动驾驶之心 "公众号戳我-> 领取自动驾驶近15个方向学习路线 >>点击进入→ 自动驾驶之心『大模型』技术交流群主要贡献 1. 高效混合架构设计：结合MoE架构与Lightning Attention）的模型MiniMax-M1，支持百万级上下文窗口（1M tokens），生成长度达80K tokens时FLOPs仅为传统注意力模型的25%。 2. 超越DAPO的算法CISPO ：通过剪裁重要性采样权重提升RL效率，相比DAPO实现2倍加速，避免了传统方法（如PPO/GRPO）对低概率token有更好的采样效果。 3. 可扩展上下文：支持从40K到80K Token生成长度的扩展。本文只做学术分享，如有侵权，联系删文 1.混合注意力架构 Lighting Attention : 采用I/O感知的线性注意力计算，通过分块计算和内存优化，将长 ...

Large Language Model

Reinforcement Learning

Artificial Intelligence

Reinforcement Learning

Artificial Intelligence

MiniMax-M1

Lighting Attention

CISPO

量产项目卡在了场景泛化，急需千万级自动标注？

自动驾驶之心· 2025-06-21 13:15

而自从端到端和大语言LLM横空出世以来，大规模无监督的预训练 + 高质量数据集做具体任务的微调，可能也会成为量产感知算法下一阶段需要发力的方向。同时数据的联合标注也是当下各家训练模型的实际刚需，以往分开标注的范式不再适合智能驾驶的算法发展需求。今天自动驾驶之心就和大家一起分享下4D数据的标注流程：最复杂的当属动态障碍物的自动标注，涉及四个大的模块：而为了尽可能的提升3D检测的性能，业内使用最多的还是点云3D目标检测或者LV融合的方法：得到离线单帧的3D检测结果后，需要利用跟踪把多帧结果串联起来，但当下跟踪也面临诸多的实际问题：离线3D目标检测；离线跟踪；后处理优化；传感器遮挡优化；点击下方卡片，关注" 自动驾驶之心 "公众号戳我-> 领取自动驾驶近15个方向学习路线千万级4D标注方案应该怎么做？智能驾驶算法的开发已经到了深水区，各家都投入了大量的精力去做量产落地。其中一块最关键的就是如何高效的完成4D数据标注。无论是3D动态目标、OCC还是静态标注。相比于车端的感知算法，自动标注系统更像是一个不同模块组成的系统，充分利用离线的算力和时序信息，才能得到更好的感知结果 ...

4D Automatic Annotation

Autonomous Driving Data Closed - Loop

Autonomous Driving

Autonomous Driving 4D Automatic Annotation Algorithm Employment Small Class Course

4D Automatic Annotation

Autonomous Driving Data Closed - Loop

Autonomous Driving

Autonomous Driving 4D Automatic Annotation Algorithm Employment Small Class Course

商汤绝影世界模型负责人离职。。。

自动驾驶之心· 2025-06-21 13:15

Core Viewpoint - The article discusses the challenges and opportunities faced by SenseTime's autonomous driving division, particularly focusing on the competitive landscape and the importance of technological advancements in the industry. Group 1: Company Developments - The head of the world model development for SenseTime's autonomous driving division has left the company, which raises concerns about the future of their cloud technology system and the R-UniAD generative driving solution [2][3]. - SenseTime's autonomous driving division has successfully delivered a mid-tier solution based on the J6M model to GAC Trumpchi, but the mid-tier market is expected to undergo significant upgrades this year [4]. Group 2: Market Dynamics - The mid-tier market will see a shift from highway-based NOA (Navigation on Autopilot) to full urban NOA, which represents a major change in the competitive landscape [4]. - Leading companies are introducing lightweight urban NOA solutions based on high-tier algorithms, targeting chips with around 100 TOPS computing power, which are already being demonstrated to OEM clients [4]. Group 3: High-Tier Strategy - The key focus for SenseTime this year is the one-stage end-to-end solution, which has shown impressive performance and is a requirement for high-tier project tenders from OEMs [5]. - Collaborations with Dongfeng Motor aim for mass production and delivery of the UniAD one-stage end-to-end solution by Q4 2025, marking a critical opportunity for SenseTime to establish a foothold in the high-tier market [5][6]. Group 4: Competitive Landscape - SenseTime's ability to deliver a benchmark project in the high-tier segment is crucial for gaining credibility with OEMs and securing additional projects [6][7]. - The current window of opportunity for SenseTime in the high-tier market is limited, as many models capable of supporting high-tier software and hardware costs are being released this year [6][8].

自动驾驶基础模型全面盘点（LLM/VLM/MLLM/扩散模型/世界模型）

自动驾驶之心· 2025-06-21 11:18

Core Insights - The article discusses the critical role of foundation models in generating and analyzing complex driving scenarios for autonomous vehicles, emphasizing their ability to synthesize diverse and realistic high-risk safety scenarios [2][4]. Group 1: Foundation Models in Autonomous Driving - Foundation models enable the processing of heterogeneous inputs such as natural language, sensor data, and high-definition maps, facilitating the generation and analysis of complex driving scenarios [2]. - A unified classification system is proposed, covering various model types including Large Language Models (LLMs), Vision-Language Models (VLMs), Multimodal Large Language Models (MLLMs), Diffusion Models (DMs), and World Models (WMs) [2][4]. Group 2: Methodologies and Tools - The article reviews methodologies, open-source datasets, simulation platforms, and benchmark testing challenges relevant to scenario generation and analysis [2]. - Specific evaluation metrics for assessing scenario generation and analysis are discussed, highlighting the need for dedicated assessment standards in this field [2]. Group 3: Current Challenges and Future Directions - The article identifies open challenges and research questions in the field of scenario generation and analysis, suggesting areas for future research and development [2].

自动驾驶

基础模型

Autonomous Driving

基础模型（Foundation Models）

基础模型（Foundation Models）

大语言模型（LLMs）

视觉 - 语言模型（VLMs）

多样化大规模数据集！SceneSplat++：首个基于3DGS的综合基准~

自动驾驶之心· 2025-06-20 14:06

Core Insights - The article introduces SceneSplat-Bench, a comprehensive benchmark for evaluating visual-language scene understanding methods based on 3D Gaussian Splatting (3DGS) [11][30]. - It presents SceneSplat-49K, a large-scale dataset containing approximately 49,000 raw scenes and 46,000 filtered 3DGS scenes, which is the most extensive open-source dataset for complex and high-quality scene-level 3DGS reconstruction [9][30]. - The evaluation indicates that generalizable methods consistently outperform per-scene optimization methods, establishing a new paradigm for scalable scene understanding through pre-trained models [30]. Evaluation Protocols - The benchmark evaluates methods based on two key metrics in 3D space: foreground mean Intersection over Union (f-mIoU) and foreground mean accuracy (f-mAcc), addressing object size imbalance and reducing viewpoint dependency compared to 2D evaluations [22][30]. - The evaluation dataset includes ScanNet, ScanNet++, and Matterport3D for indoor scenes, and HoliCity for outdoor scenes, emphasizing the methods' capabilities across various object scales and complex environments [22][30]. Dataset Contributions - SceneSplat-49K is compiled from multiple sources, including SceneSplat-7K, DL3DV-10K, HoliCity, and Aria Synthetic Environments, ensuring a diverse range of indoor and outdoor environments [9][10]. - The dataset preparation involved approximately 891 GPU days and extensive human effort, highlighting the significant resources invested in creating a high-quality dataset [7][9]. Methodological Insights - The article categorizes methods into three types: per-scene optimization methods, per-scene optimization-free methods, and generalizable methods, with SceneSplat representing the latter [23][30]. - Generalizable methods eliminate the need for extensive single-scene computations during inference, allowing for efficient processing of 3D scenes in a single forward pass [24][30]. Performance Results - The results from SceneSplat-Bench demonstrate that SceneSplat excels in both performance and efficiency, often surpassing the pseudo-label methods used for its pre-training [24][30]. - The performance of various methods shows significant variation based on the dataset's complexity, indicating the importance of challenging benchmarks in revealing the limitations of competing methods [28][30].

3D Gaussian Splatting (3DGS)

Visual-Language Reasoning

Computer Vision

SceneSplat-Bench

SceneSplat-49K

3D Gaussian Splatting (3DGS)

Visual-Language Reasoning