Workflow
自动驾驶之心
icon
Search documents
专攻长尾场景!同济CoReVLA:双阶段端到端新框架
自动驾驶之心· 2025-09-23 23:32
自动驾驶技术在 长尾场景(低频率、高风险的安全关键场景) 中表现仍存在显著短板——这类场景虽不常见,却占自动驾驶事故的很大比例,且会导致驾驶员 接管率急剧上升。 传统模块化自动驾驶系统(感知-预测-规划分阶段)存在"误差累积"问题:各阶段的微小误差会逐步放大,导致整体性能难以提升;而端到端方法直接将传感器 输入映射为控制动作或者自车的轨迹,具备更强的适应性和统一优化能力,被认为是解决长尾场景问题的潜在方向。 而当前端到端方法主要分为两类,但均无法很好应对长尾场景: CoReVLA 核心设计:"Collect-and-Refine"双阶段框架 为解决上述问题,CoReVLA提出 持续学习的双阶段框架 ,通过"数据收集(Collect)"与"行为优化(Refine)"循环,提升长尾场景下的决策能力。整体流程如 figure 1所示,分为预阶段(SFT)、第一阶段(接管数据收集)、第二阶段(DPO优化)三部分。 预阶段:基于QA数据的监督微调(SFT) 此阶段的目标是让VLA模型建立自动驾驶领域的基础认知,为后续长尾场景学习铺垫。 $${\mathcal{L}}_{S F T}=-\sum_{i=1}^{N}\su ...
世界模型能够从根本上解决VLA系统对数据的依赖,是伪命题...
自动驾驶之心· 2025-09-23 11:37
Core Viewpoint - The article discusses the ongoing debate between two approaches in the autonomous driving sector: VLA (Vision-Language Action) and WA (World Model), highlighting that both are fundamentally reliant on data, but differ in their methodologies and implications for the future of autonomous driving [1][2]. Summary by Sections VLA vs. WA - The autonomous driving landscape is splitting into two camps by 2025: companies like Xiaopeng, Li Auto, and Yuanrong Qixing are betting on the VLA approach, while Huawei and NIO are advocating for the WA model [1]. - WA is claimed to be the ultimate solution for achieving true autonomous driving, but the article argues that it is merely a rebranding of data dependency [1]. Data Dependency - Both VLA and WA are based on the premise that "data determines the upper limit" of capabilities [2]. - VLA relies on real-world multimodal data to train reasoning abilities, while WA requires a combination of real data and simulated data to enhance its capabilities [2]. - The industry is confused about the distinction between "data form" and "data essence," leading to misconceptions about the reliance on data [2]. Industry Misconceptions - The article emphasizes that the discussion should not focus on whether data is needed, but rather on how to efficiently utilize data [2]. - VLA and WA represent different methods of data collection and usage, with data remaining the core competitive advantage in autonomous driving until true artificial intelligence is realized [2]. Community and Resources - The "Autonomous Driving Knowledge Planet" community has over 4,000 members and aims to grow to nearly 10,000 in two years, providing a platform for technical exchange and sharing of knowledge in the autonomous driving field [4][10]. - The community offers resources such as learning routes, technical discussions, and access to industry experts, facilitating knowledge sharing among newcomers and advanced practitioners [4][11].
一汽正式收购大疆卓驭!落下智能驾驶功课的车企们,正在加速补作业...
自动驾驶之心· 2025-09-23 03:44
Core Viewpoint - The acquisition of DJI's automotive division, Zhuoyue Technology, by China First Automobile Works (FAW) marks a significant development in the autonomous driving sector, enhancing FAW's competitive edge in smart driving technology [1][5]. Group 1: Acquisition Details - On September 22, the State Administration for Market Regulation announced FAW's acquisition of Zhuoyue Technology [1]. - Zhuoyue, previously part of DJI's automotive division, has evolved from a low-computing power, high-cost performance model to a mid-to-high-end computing platform, introducing laser radar solutions and integrated cockpit technologies [3]. Group 2: Financial and Market Impact - Zhuoyue has raised over 2.5 billion yuan in funding from various automotive companies and institutions, including BYD and SAIC [3]. - It is projected that by 2025, around 2 million vehicles will be equipped with DJI's automotive intelligent driving systems, with partnerships expected to increase to 5 million vehicles within 3-5 years [5]. Group 3: Strategic Implications - The acquisition is beneficial for both parties; FAW will gain Zhuoyue's technological advantages in intelligent driving, enhancing the competitiveness of its vehicle lineup and accelerating the transition to smart driving [5][6]. - Zhuoyue's growth reflects a decade of advancements in autonomous driving technology, positioning it as a significant player outside of Huawei's influence in the sector [7][8].
三维重建综述:从多视角几何到 NeRF 与 3DGS 的演进
自动驾驶之心· 2025-09-22 23:34
Core Viewpoint - 3D reconstruction is a critical intersection of computer vision and graphics, serving as the digital foundation for cutting-edge applications such as virtual reality, augmented reality, autonomous driving, and digital twins. Recent advancements in new perspective synthesis technologies, represented by Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have significantly improved reconstruction quality, speed, and dynamic adaptability [5][6]. Group 1: Introduction and Demand - The resurgence of interest in 3D reconstruction is driven by new application demands across various fields, including city-scale digital twins requiring kilometer-level coverage and centimeter-level accuracy, autonomous driving simulations needing dynamic traffic flow and real-time semantics, and AR/VR social applications demanding over 90 FPS and photo-realistic quality [6]. - Traditional reconstruction pipelines are inadequate for these new requirements, prompting the integration of geometry, texture, and lighting through differentiable rendering techniques [6]. Group 2: Traditional Multi-View Geometry Reconstruction - The traditional multi-view geometry approach (SfM to MVS) has inherent limitations in quality, efficiency, and adaptability to dynamic scenes, which have been addressed through iterative advancements in NeRF and 3DGS technologies [7]. - A comprehensive comparison of various methods highlights the evolution and future challenges in the field of 3D reconstruction [7]. Group 3: NeRF and Its Innovations - NeRF models scenes as continuous 5D functions, enabling advanced rendering techniques that have evolved significantly from 2020 to 2024, addressing issues such as data requirements, texture limitations, lighting sensitivity, and dynamic scene handling [13][15]. - Various methods have been developed to enhance quality and efficiency, including Mip-NeRF, NeRF-W, and InstantNGP, each contributing to improved rendering speeds and reduced memory usage [17][18]. Group 4: 3DGS and Its Advancements - 3DGS represents scenes as collections of 3D Gaussians, allowing for efficient rendering and high-quality output. Recent methods have focused on optimizing rendering quality and efficiency, achieving significant improvements in memory usage and frame rates [22][26]. - The comparison of 3DGS with other methods shows its superiority in rendering speed and dynamic scene reconstruction capabilities [31]. Group 5: Future Trends and Conclusion - The next five years are expected to see advancements in hybrid representations, real-time processing on mobile devices, generative reconstruction techniques, and multi-modal fusion for robust reconstruction [33]. - The ultimate goal is to enable real-time 3D reconstruction accessible to everyone, marking a shift towards ubiquitous computing [34].
急需一台性价比高的3D激光扫描仪!
自动驾驶之心· 2025-09-22 23:34
Core Viewpoint - The article introduces the GeoScan S1, a highly cost-effective 3D laser scanner designed for industrial and educational applications, featuring lightweight design, easy operation, and high precision in 3D scene reconstruction [1][9]. Product Features - The GeoScan S1 offers a point cloud generation rate of 200,000 points per second, a measurement range of up to 70 meters, and supports scanning areas exceeding 200,000 square meters [1][29]. - It integrates multiple sensors and supports cross-platform integration, providing flexibility for various research and development applications [1][44]. - The device is equipped with a handheld Ubuntu system and various sensor devices, allowing for easy power supply management [3]. User Experience - The scanner is designed for low entry barriers, with simple one-button operation for scanning tasks and immediate usability of the exported results [5]. - It features real-time modeling and high-precision mapping capabilities, producing color-rich point cloud data [27][34]. Technical Specifications - The GeoScan S1 supports real-time point cloud mapping with a relative accuracy of better than 3 cm and absolute accuracy of better than 5 cm [22]. - It has a compact size of 14.2 cm x 9.5 cm x 45 cm and weighs 1.3 kg without the battery, with a battery life of approximately 3 to 4 hours [22][26]. Market Position - The product is positioned as the most cost-effective option in the market, with a starting price of 19,800 yuan for the basic version [9][57]. - Various versions are available, including a depth camera version and online/offline 3DGS versions, catering to different user needs [57]. Application Scenarios - The GeoScan S1 is suitable for a wide range of environments, including office buildings, parking lots, industrial parks, tunnels, forests, and mining sites, effectively completing 3D scene mapping [38][46].
放榜了!NeurIPS 2025论文汇总(自动驾驶/大模型/具身/RL等)
自动驾驶之心· 2025-09-22 23:34
Core Insights - The article discusses the recent announcements from NeurIPS 2025, focusing on advancements in autonomous driving, visual perception reasoning, large model training, embodied intelligence, reinforcement learning, video understanding, and code generation [1]. Autonomous Driving - The article highlights various research papers related to autonomous driving, including "FutureSightDrive" and "AutoVLA," which explore visual reasoning and end-to-end driving models [2][4]. - A collection of papers and codes from institutions like Alibaba, UCLA, and Tsinghua University is provided, showcasing the latest developments in the field [6][7][13]. Visual Perception Reasoning - The article mentions "SURDS," which benchmarks spatial understanding and reasoning in driving scenarios using vision-language models [11]. - It also references "OmniSegmentor," a flexible multi-modal learning framework for semantic segmentation [16]. Large Model Training - The article discusses advancements in large model training, including papers on scaling offline reinforcement learning and fine-tuning techniques [40][42]. - It emphasizes the importance of adaptive methods for improving model performance in various applications [44]. Embodied Intelligence - Research on embodied intelligence is highlighted, including "Self-Improving Embodied Foundation Models" and "ForceVLA," which enhance models for contact-rich manipulation [46][48]. Video Understanding - The article covers advancements in video understanding, particularly through the "PixFoundation 2.0" project, which investigates the use of motion in visual grounding [28][29]. Code Generation - The article mentions developments in code generation, including "Fast and Fluent Diffusion Language Models" and "Step-By-Step Coding for Improving Mathematical Olympiad Performance" [60].
FlowDrive:一个具备软硬约束的可解释端到端框架(上交&博世)
自动驾驶之心· 2025-09-22 23:34
Core Insights - The article introduces FlowDrive, a novel end-to-end driving framework that integrates energy-based flow field representation, adaptive anchor trajectory optimization, and motion-decoupled trajectory generation to enhance safety and interpretability in autonomous driving [4][45]. Group 1: Introduction and Background - End-to-end autonomous driving has gained attention for its potential to simplify traditional modular pipelines and leverage large-scale data for joint learning of perception, prediction, and planning tasks [4]. - A mainstream research direction involves generating Bird's Eye View (BEV) representations from multi-view camera inputs, which provide structured spatial views beneficial for downstream planning tasks [4][6]. Group 2: FlowDrive Framework - FlowDrive introduces energy-based flow fields in the BEV space to explicitly model geometric constraints and rule-based semantics, enhancing the effectiveness of BEV representations [7][15]. - The framework includes a flow-aware anchor trajectory optimization module that aligns initial trajectories with safe and goal-oriented areas, improving spatial effectiveness and intention consistency [15][22]. - A task-decoupled diffusion planner separates high-level intention prediction from low-level trajectory denoising, allowing for targeted supervision and flow field conditional decoding [9][27]. Group 3: Experimental Results - Experiments on the NAVSIM v2 benchmark dataset demonstrate that FlowDrive achieves state-of-the-art performance, with an Extended Predictive Driver Model Score (EPDMS) of 86.3, surpassing previous benchmark methods [3][40]. - FlowDrive shows significant advantages in safety-related metrics such as Drivable Area Compliance (DAC) and Time to Collision (TTC), indicating superior adherence to driving constraints and hazard avoidance capabilities [40][41]. - The framework's performance is validated through ablation studies, showing that removing any core component leads to significant declines in overall performance [43][47]. Group 4: Technical Details - The flow field learning module encodes dense, physically interpretable spatial gradients to provide fine-grained guidance for trajectory planning [20][21]. - The perception module utilizes a Transformer-based architecture to effectively fuse multi-modal sensor inputs into a compact and semantically rich BEV representation [18][37]. - The training process involves a composite loss function that supervises trajectory planning, anchor trajectory optimization, flow field modeling, and auxiliary perception tasks [30][31][32][34].
自驾方向适合去工作、读博还是转行?
自动驾驶之心· 2025-09-22 10:30
Core Viewpoint - The article discusses the decision-making process for individuals in the autonomous driving field regarding whether to pursue a PhD, continue working, or switch careers, emphasizing the importance of foundational knowledge and practical experience in the industry [2][3]. Group 1: Career Decisions - The article highlights two critical questions for individuals considering a career in autonomous driving: the availability of foundational knowledge and practical experience in their current environment, and their readiness to take on pioneering research roles if pursuing a PhD [2][3]. - It points out that many academic mentors may lack deep expertise in autonomous driving, which can hinder students' development if they do not have a solid foundation [2]. - The article suggests that students should assess their preparedness to independently explore and solve problems, especially in cutting-edge research areas where few references exist [2][3]. Group 2: Community and Resources - The "Autonomous Driving Heart Knowledge Planet" community is introduced as a resource for beginners, offering a comprehensive platform for learning, sharing knowledge, and networking within the autonomous driving field [3][5]. - The community has over 4,000 members and aims to grow to nearly 10,000 in the next two years, providing a space for technical sharing and job-seeking interactions [3][5]. - Various practical questions and topics are addressed within the community, including entry points for end-to-end systems, multi-modal models, and the latest industry trends [5][16]. Group 3: Learning and Development - The community offers a structured learning system with over 40 technical routes covering various aspects of autonomous driving, including perception, simulation, and planning control [7][14]. - It provides access to numerous resources, including video tutorials, technical discussions, and job opportunities, aimed at both beginners and those looking to advance their skills [8][18]. - The community also facilitates connections with industry leaders and experts, enhancing members' understanding of the latest developments and job market trends in autonomous driving [12][92].
自动驾驶VLA发展到哪个阶段了?现在还适合搞研究吗?
自动驾驶之心· 2025-09-22 08:04
Core Insights - The article discusses the transition in intelligent driving technology from rule-driven to data-driven approaches, highlighting the emergence of VLA (Vision-Language Action) as a more straightforward and effective method compared to traditional end-to-end systems [1][2] - The challenges in the current VLA technology stack are emphasized, including the complexity and fragmentation of knowledge, which makes it difficult for newcomers to enter the field [2][3] - A new practical course on VLA has been developed to address these challenges, providing a structured learning path for students interested in advanced knowledge in autonomous driving [3][4][5] Summary by Sections Introduction to VLA - The article introduces VLA as a significant advancement in autonomous driving, offering a cleaner approach than traditional end-to-end systems, while also addressing corner cases more effectively [1] Challenges in Learning VLA - The article outlines the difficulties faced by learners in navigating the complex and fragmented knowledge landscape of VLA, which includes a plethora of algorithms and a lack of high-quality documentation [2] Course Development - A new course titled "Autonomous Driving VLA Practical Course" has been created to provide a comprehensive overview of the VLA technology stack, aiming to facilitate easier entry into the field for students [3][4] Course Features - The course is designed to address key pain points, offering quick entry into the subject matter through accessible language and examples [3] - It aims to build a framework for understanding VLA research and enhance research capabilities by teaching students how to categorize papers and extract innovative points [4] - The course includes practical components to ensure that theoretical knowledge is effectively applied in real-world scenarios [5] Course Outline - The course covers various topics, including the origins of VLA, foundational algorithms, and the differences between modular and integrated VLA systems [6][15][19][20] - It also includes practical coding exercises and projects to reinforce learning and application of concepts [22][24][26] Instructor Background - The course is led by experienced instructors with a strong background in multi-modal perception, autonomous driving, and large model frameworks, ensuring high-quality education [27] Learning Outcomes - Upon completion, students are expected to have a thorough understanding of current advancements in VLA, core algorithms, and the ability to apply their knowledge in practical settings [28][29]
NeurIPS'25 Spotlight!自驾新范式FSDrive: VLA + 世界模型双管齐下(阿里&西交)
自动驾驶之心· 2025-09-21 23:32
Core Insights - The article discusses the development of a spatio-temporal Chain-of-Thought (CoT) reasoning method for Vision-Language Models (VLMs) in the autonomous driving sector, emphasizing the need for visual reasoning over symbolic logic [1][4][24] - It introduces a unified pre-training paradigm that enhances the visual generation capabilities of VLMs while maintaining their semantic understanding [6][24] Summary by Sections Introduction - Multi-modal large language models (MLLMs) have shown exceptional performance in knowledge and reasoning, leading to their application in autonomous driving [4] - The end-to-end Vision-Language-Action (VLA) model simplifies system architecture and minimizes information loss by directly generating vehicle control commands from visual observations and language instructions [4] Methodology - The spatio-temporal CoT method allows VLMs to visualize and plan trajectories by generating unified image frames that predict future states, incorporating spatial and temporal relationships [5][11] - The proposed method integrates visual cues and physical constraints to guide the model's attention towards drivable areas and key objects, enhancing trajectory planning [5][16] Pre-training Paradigm - A new pre-training approach is introduced that combines visual understanding and generation, allowing VLMs to predict future frames while adhering to physical laws [6][12] - The gradual image generation method ensures that the model first predicts coarse-grained visual cues before generating detailed future frames, maintaining physical realism [15][24] Experimental Results - Extensive experiments validate the effectiveness of the FSDrive framework in trajectory planning, future frame generation, and scene understanding, demonstrating its advancement towards visual reasoning in autonomous driving [11][24] Conclusion - FSDrive establishes an end-to-end visual reasoning pipeline that unifies future scene generation and perception results, effectively bridging the semantic gap caused by cross-modal conversions [24]