Workflow
Autonomous Driving
icon
Search documents
自动驾驶VLA:OpenDriveVLA、AutoVLA
自动驾驶之心· 2025-08-18 01:32
Core Insights - The article discusses two significant papers, OpenDriveVLA and AutoVLA, which focus on applying large visual-language models (VLM) to end-to-end autonomous driving, highlighting their distinct technical paths and philosophies [22]. Group 1: OpenDriveVLA - OpenDriveVLA aims to address the "modal gap" in traditional VLMs when dealing with dynamic 3D driving environments, emphasizing the need for structured understanding of the 3D world [23]. - The methodology includes several key steps: 3D visual environment perception, visual-language hierarchical alignment, and a multi-stage training paradigm [24][25]. - The model utilizes structured, layered tokens (Agent, Map, Scene) to enhance the VLM's understanding of the environment, which helps mitigate spatial hallucination risks [6][9]. - OpenDriveVLA achieved state-of-the-art performance in the nuScenes open-loop planning benchmark, demonstrating its effective perception-based anchoring strategy [10][20]. Group 2: AutoVLA - AutoVLA focuses on integrating driving tasks into the native operation of VLMs, transforming them from scene narrators to genuine decision-makers [26]. - The methodology features layered visual token extraction, where the model creates discrete action codes instead of continuous coordinates, thus converting trajectory planning into a next-token prediction task [14][29]. - The model employs a dual-mode thinking approach, allowing it to adapt its reasoning depth based on scene complexity, balancing efficiency and effectiveness [28]. - AutoVLA's reinforcement learning fine-tuning (RFT) enhances its driving strategy, enabling the model to optimize its behavior actively rather than merely imitating human driving [30][35]. Group 3: Comparative Analysis - OpenDriveVLA emphasizes perception-language alignment to improve VLM's understanding of the 3D world, while AutoVLA focuses on language-decision integration to enhance VLM's decision-making capabilities [32]. - The two models represent complementary approaches: OpenDriveVLA provides a robust perception foundation, while AutoVLA optimizes decision-making strategies through reinforcement learning [34]. - Future models may combine the strengths of both approaches, utilizing OpenDriveVLA's structured perception and AutoVLA's action tokenization and reinforcement learning to create a powerful autonomous driving system [36].
成本降低14倍!DiffCP:基于扩散模型的协同感知压缩新范式~
自动驾驶之心· 2025-08-18 01:32
Core Viewpoint - The article introduces DiffCP, a novel collaborative perception framework that utilizes conditional diffusion models to significantly reduce communication costs while maintaining high performance in collaborative sensing tasks [3][4][20]. Group 1: Introduction to Collaborative Perception - Collaborative perception (CP) is emerging as a promising solution to address the inherent limitations of independent intelligent systems, particularly in challenging wireless communication environments [3]. - Current C-V2X systems face significant bandwidth limitations, making it difficult to support feature-level and raw data-level collaborative algorithms [3]. Group 2: DiffCP Framework - DiffCP is the first collaborative perception architecture that employs conditional diffusion models to capture geometric correlations and semantic differences for efficient data transmission [4]. - The framework integrates prior knowledge, geometric relationships, and received semantic features to reconstruct collaborative perception information, introducing a new paradigm based on generative models [4][5]. Group 3: Performance and Efficiency - Experimental results indicate that DiffCP achieves robust perception performance in ultra-low bandwidth scenarios, reducing communication costs by 14.5 times while maintaining state-of-the-art algorithm performance [4][20]. - DiffCP can be integrated into existing BEV-based collaborative algorithms for various downstream tasks, significantly lowering bandwidth requirements [4]. Group 4: Technical Implementation - The framework utilizes a pre-trained BEV-based perception algorithm to extract BEV features, embedding diffusion time steps, relative spatial positions, and semantic vectors as conditions [5]. - An iterative denoising process is employed, where the model integrates observations from the host vehicle with collaborative features to progressively recover original collaborative perception features [8]. Group 5: Application in 3D Object Detection - DiffCP was evaluated in a case study on 3D object detection, demonstrating its ability to achieve similar accuracy levels as state-of-the-art algorithms while reducing data rates by 14.5 times [20]. - The framework allows for adaptive data rates through variable semantic vector lengths, enhancing performance in challenging scenarios [20]. Group 6: Conclusion - DiffCP represents a significant advancement in collaborative perception, enabling efficient information compression and reconstruction for collaborative sensing tasks, thus facilitating the deployment of connected intelligent transportation systems in existing wireless communication frameworks [22].
你的2026届秋招进展怎么样了?
自动驾驶之心· 2025-08-16 16:04
Core Viewpoint - The article emphasizes the convergence of autonomous driving technology, indicating a shift from numerous diverse approaches to a more unified model, which raises the technical barriers in the industry [1] Group 1 - The industry is witnessing a trend where previously many directions requiring algorithm engineers are now consolidating into unified models such as one model, VLM, and VLA [1] - The article encourages the establishment of a large community to support individuals in the industry, highlighting the limitations of individual efforts [1] - A new job and industry-related community is being launched to facilitate discussions on industry trends, company developments, product research, and job opportunities [1]
自动驾驶论文速递 | 视觉重建、RV融合、推理、VLM等
自动驾驶之心· 2025-08-16 09:43
Core Insights - The article discusses two innovative approaches in autonomous driving technology: Dream-to-Recon for monocular 3D scene reconstruction and SpaRC-AD for radar-camera fusion in end-to-end autonomous driving [2][13]. Group 1: Dream-to-Recon - Dream-to-Recon is a method developed by the Technical University of Munich that enables monocular 3D scene reconstruction using only a single image for training [2][6]. - The method integrates a pre-trained diffusion model with a deep network through a three-stage framework: 1. View Completion Model (VCM) enhances occlusion filling and image distortion correction, achieving a PSNR of 23.9 [2][6]. 2. Synthetic Occupancy Field (SOF) constructs dense 3D scene geometry from multiple synthetic views, with occlusion reconstruction accuracy (IE_acc) reaching 72%-73%, surpassing multi-view supervised methods by 2%-10% [2][6]. 3. A lightweight distilled model converts generated geometry into a real-time inference network, achieving overall accuracy (O_acc) of 90%-97% on KITTI-360/Waymo, with a 70x speed improvement (75ms/frame) [2][6]. - The method provides a new paradigm for efficient 3D perception in autonomous driving and robotics without complex sensor calibration [2][6]. Group 2: SpaRC-AD - SpaRC-AD is the first radar-camera fusion baseline framework for end-to-end autonomous driving, also developed by the Technical University of Munich [13][16]. - The framework utilizes sparse 3D feature alignment and Doppler velocity measurement techniques, achieving a 4.8% improvement in 3D detection mAP, an 8.3% increase in tracking AMOTA, a 4.0% reduction in motion prediction mADE, and a 0.11m decrease in trajectory planning L2 error [13][16]. - The overall radar-based fusion strategy significantly enhances performance across multiple tasks, including 3D detection, multi-object tracking, online mapping, and motion prediction [13][16]. - Comprehensive evaluations on open-loop nuScenes and closed-loop Bench2Drive benchmarks demonstrate its advantages in enhancing perception range, improving motion modeling accuracy, and robustness in adverse conditions [13][16].
又有很多自动驾驶工作中稿了ICCV 2025,我们发现了一些新趋势的变化...
自动驾驶之心· 2025-08-16 00:03
Core Insights - The article discusses the latest trends and research directions in the field of autonomous driving, highlighting the integration of multimodal large models and vision-language action generation as key areas of focus for both academia and industry [2][5]. Group 1: Research Directions - The research community is concentrating on several key areas, including the combination of MoE (Mixture of Experts) with autonomous driving, benchmark development for autonomous driving, and trajectory generation using diffusion models [2]. - The closed-loop simulation and world models are emerging as critical needs in autonomous driving, driven by the limitations of real-world open-loop testing. This approach aims to reduce costs and improve model iteration efficiency [5]. - There is a notable emphasis on performance improvement in object detection and OCC (Occupancy Classification and Counting), with many ongoing projects exploring specific pain points and challenges in these areas [5]. Group 2: Notable Projects and Publications - "ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation" is a significant project from Huazhong University of Science and Technology and Xiaomi, focusing on integrating vision and language for action generation in autonomous driving [5]. - "All-in-One Large Multimodal Model for Autonomous Driving" is another important work from Zhongshan University and Meituan, contributing to the development of comprehensive models for autonomous driving [6]. - "MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding" from Chongqing University aims to enhance understanding of driving scenarios through multimodal analysis [8]. Group 3: Simulation and Reconstruction - The project "Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images" from TUM focuses on advanced reconstruction techniques for autonomous driving [14]. - "CoDa-4DGS: Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving" from Fraunhofer IVI and TU Munich is another notable work that addresses dynamic scene reconstruction [16]. Group 4: Trajectory Prediction and World Models - "Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics" from Hong Kong University of Science and Technology and Didi emphasizes the importance of trajectory prediction in autonomous driving [29]. - "World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model" from the Chinese Academy of Sciences focuses on developing a comprehensive world model for autonomous driving [32].
死磕技术的自动驾驶黄埔军校,4000人了!
自动驾驶之心· 2025-08-15 14:23
Core Viewpoint - The article emphasizes the establishment of a comprehensive community focused on autonomous driving, aiming to bridge the gap between academia and industry while providing valuable resources for learning and career opportunities in the field [2][16]. Group 1: Community and Resources - The community has created a closed-loop system covering various fields such as industry, academia, job seeking, and Q&A exchanges, enhancing the learning experience for participants [2][3]. - The platform offers cutting-edge academic content, industry roundtables, open-source code solutions, and timely job information, significantly reducing the time needed for research [3][16]. - Members can access nearly 40 technical routes, including industry applications, VLA benchmarks, and entry-level learning paths, catering to both beginners and advanced researchers [3][16]. Group 2: Learning and Development - The community provides a well-structured learning path for beginners, including foundational knowledge in mathematics, computer vision, deep learning, and programming [10][12]. - For those already engaged in research, valuable industry frameworks and project proposals are available to further their understanding and application of autonomous driving technologies [12][14]. - Continuous job sharing and career opportunities are promoted within the community, fostering a complete ecosystem for autonomous driving [14][16]. Group 3: Technical Focus Areas - The community has compiled extensive resources on various technical aspects of autonomous driving, including perception, simulation, planning, and control [16][17]. - Specific learning routes are available for topics such as end-to-end learning, 3DGS principles, and multi-modal large models, ensuring comprehensive coverage of the field [16][17]. - The platform also features a collection of open-source projects and datasets relevant to autonomous driving, facilitating hands-on experience and practical application [32][34].
WeRide Secures Strategic Equity Investment from Grab, Partners to Deploy Robotaxis and Autonomous Shuttles in Southeast Asia
Globenewswire· 2025-08-15 09:18
Core Insights - WeRide has announced a strategic equity investment from Grab to accelerate the deployment of Level 4 Robotaxis and shuttles in Southeast Asia, aiming to integrate WeRide's autonomous vehicles into Grab's network for improved service and safety [2][3][5] Investment and Partnership - Grab's investment is expected to be finalized by the first half of 2026, contingent on customary closing conditions and WeRide's preferred timing, supporting WeRide's growth strategy in Southeast Asia [3] - This partnership builds on a prior Memorandum of Understanding signed in March 2025, focusing on the technical feasibility, commercial viability, and job creation potential of autonomous vehicles in the region [8] Operational Integration - The collaboration will establish a framework for deploying autonomous solutions across Grab's network, enhancing operational efficiency and scalability [4] - WeRide will integrate its autonomous driving technology into Grab's fleet management, vehicle matching, and routing ecosystem [4][12] Vision and Goals - WeRide aims to deploy thousands of Robotaxis in Southeast Asia, aligning with local regulations and societal readiness, leveraging Grab's regional expertise in ride-hailing and digital services [5] - Grab emphasizes the need for reliable transportation in Southeast Asia, particularly in areas with driver shortages, and plans to test WeRide's vehicles in diverse environments to adapt the technology for regional needs [6] Technical Collaboration - The partnership will focus on optimizing dispatch and routing, maximizing vehicle uptime, measuring safety performance, remote monitoring, customer support, and training for driver-partners and local communities [12]
多空博弈Robotaxi:“木头姐”建仓,机构现分歧
Di Yi Cai Jing· 2025-08-15 03:45
唱多、唱空交织,推动自动驾驶技术成熟。 今年以来,Robotaxi(自动驾驶出租车)受到全球资本市场广泛关注,但质疑声也如约而至。 近日,"木头姐"Cathie Wood旗下ARK基金斥资约1290万美元买入小马智行(NASDAQ:PONY)股 票,这是"木头姐"的主力基金首次持仓中国自动驾驶标的。据悉,"木头姐"被华尔街认为是"女版巴菲 特",其投资偏好是高成长、高风险及长期持有。 另一家中国Robotaxi头部企业文远知行(NASDAQ:WRD)二季度Robotaxi业务同比大增836.7%,该公 司早在今年5月就披露了Uber承诺向其追加投资1亿美元的事宜。 记者近期在广州体验百度旗下萝卜快跑Robotaxi时也出现"高峰期等车时间长达1个小时、且无车接 单"的情况。当记者问询叫车点附近运营车辆数量时,萝卜快跑客服回应称:"城市的可服务车辆并非固 定不变,会受多方因素影响进行动态调整。"根据附近居民、商户的反馈,下班高峰期萝卜快跑的等车 时长大于40分钟。 不可否认的是,现阶段Robotaxi派单时长、等车时长均较有人网约车更多,也是行业需要解决的课题。 韩旭表示,当自动驾驶公司开拓一个新城市时,自动驾 ...
地平线&清华Epona:自回归式世界端到端模型~
自动驾驶之心· 2025-08-12 23:33
Core Viewpoint - The article discusses a unified framework for autonomous driving world models that can generate long-term high-resolution video while providing real-time trajectory planning, addressing limitations of existing methods [5][12]. Group 1: Existing Methods and Limitations - Current diffusion models, such as Vista, can only generate fixed-length videos (≤15 seconds) and struggle with flexible long-term predictions (>2 minutes) and multi-modal trajectory control [7]. - GPT-style autoregressive models, like GAIA-1, can extend indefinitely but require discretizing images into tokens, which degrades visual quality and lacks continuous action trajectory generation capabilities [7][13]. Group 2: Proposed Methodology - The proposed world model in the autonomous driving domain uses a series of forward camera observations and corresponding driving trajectories to predict future driving dynamics [10]. - The framework decouples spatiotemporal modeling using causal attention in a GPT-style transformer and a dual-diffusion transformer for spatial rendering and trajectory generation [12]. - An asynchronous multimodal generation mechanism allows for parallel generation of 3-second trajectories and the next frame image, achieving 20Hz real-time planning with a 90% reduction in inference computational power [12]. Group 3: Model Structure and Training - The Multimodal Spatiotemporal Transformer (MST) encodes past driving scenes and action sequences, enhancing temporal position encoding for implicit representation [16]. - The Trajectory Planning Diffusion Transformer (TrajDiT) and Next-frame Prediction Diffusion Transformer (VisDiT) are designed to handle trajectory and image predictions, respectively, with a focus on action control [21]. - A chain-of-forward training strategy is employed to mitigate the "drift problem" in autoregressive inference by simulating prediction noise during training [24]. Group 4: Performance Evaluation - The model demonstrates superior performance in video generation metrics, achieving a FID score of 7.5 and a FVD score of 82.8, outperforming several existing models [28]. - In trajectory control metrics, the proposed method achieves a high accuracy rate of 97.9% in comparison to other methods [34]. Group 5: Conclusion and Future Directions - The framework integrates image generation and vehicle trajectory prediction with high quality, showing strong potential for applications in closed-loop simulation and reinforcement learning [36]. - However, the current model is limited to single-camera input, indicating a need for addressing multi-camera consistency and point cloud generation challenges in the autonomous driving field [36].
Pony Ai(PONY) - 2025 Q2 - Earnings Call Transcript
2025-08-12 13:02
Financial Data and Key Metrics Changes - Total revenues for Q2 reached $21.5 million, a 76% increase year over year, driven by strong growth in robotaxi services and licensing applications [39][41] - Robotaxi service revenues grew to $1.5 million, reflecting a 158% year over year increase, with fare charging revenues expanding by over 300% [39][40] - Gross margin improved to 16.1%, with gross profit of $3.5 million in Q2 [42] - Net loss for Q2 was $53.3 million, up from $30.9 million in the same period last year [44] Business Line Data and Key Metrics Changes - Robotaxi service revenues surged by 150% year over year, with fare charging revenues growing more than 300% [15][39] - Licensing and application revenues reached $10.4 million, a 902% increase year over year [41] - Global truck services revenue decreased by 10% year over year [41] Market Data and Key Metrics Changes - Registered users surged by 136% year over year in Q2, with a user satisfaction rate above 4.8 out of 5 [8][17] - The company operates across 2,000 square kilometers in Tier one cities in China, significantly expanding its market reach [56] Company Strategy and Development Direction - The company aims for mass production of Gen seven robotaxis, targeting over 1,000 vehicles by year-end 2025 [7][23] - A strategic partnership with Hehu Group aims to deploy over 1,000 robotaxis in Shenzhen [16] - The focus is on scaling up operations and enhancing user experience to drive higher demand [23][56] Management's Comments on Operating Environment and Future Outlook - Management expressed confidence in achieving positive unit economics for Gen seven vehicles, citing significant cost reductions and operational efficiencies [51] - The company is well-positioned for large-scale commercialization, with a solid plan and execution strategy in place [45][36] Other Important Information - The company has secured Shanghai's first fully driverless commercial license, enabling operations in all four Tier one cities [18][32] - The bond cost of Gen seven robotaxis has been reduced by 70% compared to previous generations [51] Q&A Session Summary Question: Production plan throughout 2025 - Management confirmed they are on track to exceed 1,000 robotaxi vehicles by year-end, with over 200 already produced [47][49] Question: Key drivers behind robotaxi revenue growth - Management highlighted user adoption, demand in Tier one cities, and an increased fleet as key drivers of revenue growth [55][56] Question: Impact of government comments on L4 robotaxi industry - Management noted that recent comments clarify the distinction between L2 and L4 systems, which is beneficial for public understanding and safety standards [60][62] Question: Key technical requirements for new market expansion - Management emphasized the ability to handle corner cases and the robustness of their software system as critical for entering new geographies [66][68] Question: Timetable for potential Hong Kong IPO - Management refrained from commenting on market speculation but stated they are monitoring market conditions closely [73][74] Question: Future plans for overseas market expansion - Management outlined a focus on markets with strong mobility demand and supportive regulatory environments, with ongoing operations in Dubai, South Korea, and Luxembourg [78][80]