自动驾驶之心 - filings, earnings calls, financial reports, news

自动驾驶之心

Search documents

端到端全新范式！复旦VeteranAD："感知即规划"刷新开闭环SOTA，超越DiffusionDrive~

自动驾驶之心· 2025-08-21 23:34

Core Insights - The article introduces a novel "perception-in-plan" paradigm for end-to-end autonomous driving, implemented in the VeteranAD framework, which integrates perception directly into the planning process, enhancing the effectiveness of planning optimization [5][39]. - VeteranAD demonstrates superior performance on challenging benchmarks, NAVSIM and Bench2Drive, showcasing the benefits of tightly coupling perception and planning for improved accuracy and safety in autonomous driving [12][39]. Summary by Sections Introduction - The article discusses significant advancements in end-to-end autonomous driving, emphasizing the need to unify multiple tasks within a single framework to prevent information loss across stages [2][3]. Proposed Framework - VeteranAD framework is designed to embed perception into planning, allowing the perception module to operate more effectively in alignment with planning needs [5][6]. - The framework consists of two core modules: Planning-Aware Holistic Perception and Localized Autoregressive Trajectory Planning, which work together to enhance the performance of end-to-end planning tasks [12][39]. Core Modules - **Planning-Aware Holistic Perception**: This module interacts across three dimensions—image features, BEV features, and surrounding traffic features—to achieve a comprehensive understanding of traffic elements [6]. - **Localized Autoregressive Trajectory Planning**: This module generates future trajectories in an autoregressive manner, progressively refining the planned trajectory based on perception results [6][16]. Experimental Results - VeteranAD achieved a PDM Score of 90.2 on the NAVSIM navtest dataset, outperforming previous learning methods and demonstrating its effectiveness in end-to-end planning [21]. - In open-loop evaluations, VeteranAD recorded an average L2 error of 0.60, surpassing all baseline methods, while maintaining competitive performance in closed-loop evaluations [25][33]. Ablation Studies - Ablation studies indicate that the use of guiding points from anchored trajectories is crucial for accurate planning, as removing these points significantly degrades performance [26]. - The combination of both core modules results in enhanced performance, highlighting their complementary nature [26]. Conclusion - The article concludes that the "perception-in-plan" design significantly improves end-to-end planning accuracy and safety, paving the way for future research in more efficient and reliable autonomous driving systems [39].

端到端自动驾驶

perception-in-plan（感知融入规划）

Autonomous Driving

VeteranAD

端到端自动驾驶

perception-in-plan（感知融入规划）

Autonomous Driving

VeteranAD

实测DeepSeek V3.1：不止拓展上下文长度

自动驾驶之心· 2025-08-21 23:34

Core Viewpoint - The article discusses the differences between DeepSeek V3.1 and its predecessor V3, highlighting improvements in programming capabilities, creative writing, translation quality, and response tone. Group 1: Model Comparison - DeepSeek V3.1 has extended its context length to 128K tokens, compared to V3's 65K tokens, allowing for more comprehensive responses [10] - The new version shows significant enhancements in various tasks, including programming, creative writing, translation, and knowledge application [3][4] Group 2: Programming Capability - In a programming test, V3.1 provided a more comprehensive solution for compressing GIF files, considering more factors and providing detailed usage instructions [12][13][14] - The performance of V3.1 was notably faster in executing the task compared to V3 [18] Group 3: Creative Writing - For a creative writing task based on a high school essay prompt, V3.1 produced a more poetic and emotional response, contrasting with V3's more straightforward style [22] Group 4: Translation Quality - In translating a scientific abstract, V3.1 demonstrated a better understanding of complex sentences, although it missed translating a simple word, indicating room for improvement [30] Group 5: Knowledge Application - Both versions provided answers to a niche question about a specific fruit type, with V3.1 showing some inconsistencies in terminology and relevance [31][37] Group 6: Performance Metrics - V3.1 achieved a score of 71.6% on the Aider benchmark, outperforming Claude Opus 4 while being significantly cheaper [43] - On the SVGBench, V3.1 was noted as the best variant among its peers, although it still did not surpass the best open models [44] Group 7: User Feedback - Users have reported various observations regarding the new features and performance of V3.1, including improvements in physical understanding and the introduction of new tokens [45][47]

自动驾驶之心· 2025-08-21 23:34

Core Viewpoint - The article discusses the capabilities of the MindVLA model in autonomous driving, emphasizing its advanced scene understanding and decision-making abilities compared to traditional E2E models. Group 1: VLA Capabilities - The VLA model demonstrates effective defensive driving, particularly in scenarios with obstructed views, by smoothly adjusting speed based on remaining distance [4][5]. - In congested traffic situations, VLA shows improved decision-making by choosing to change lanes rather than following the typical detour logic of E2E models [7]. - The VLA model exhibits enhanced lane centering abilities in non-standard lane widths, significantly reducing the occurrence of erratic driving patterns [9][10]. Group 2: Scene Understanding - VLA's decision-making process reflects a deeper understanding of traffic scenarios, allowing it to make more efficient lane changes and route selections [11]. - The model's ability to maintain stability in trajectory generation is attributed to its use of diffusion models, which enhances its performance in various driving conditions [10]. Group 3: Comparison with E2E Models - The article highlights that E2E models struggle with nuanced driving behaviors, often resulting in abrupt maneuvers, while VLA provides smoother and more context-aware driving responses [3][4]. - VLA's architecture allows for parallel optimization across different scenarios, leading to faster iterations and improvements compared to E2E models [12]. Group 4: Limitations and Future Considerations - Despite its advancements, VLA is still classified as an assistive driving technology rather than fully autonomous driving, requiring human intervention in certain situations [12]. - The article raises questions about the model's performance in specific scenarios, indicating areas for further development and refinement [12].

没有高效的技术和行业信息渠道，很多时间浪费了。。。

自动驾驶之心· 2025-08-21 23:34

Core Insights - The article emphasizes the importance of efficient information collection channels for individuals seeking to transition into the autonomous driving industry, highlighting the establishment of a comprehensive community that integrates academic content, industry discussions, open-source solutions, and job opportunities [1][3]. Group 1: Community and Resources - The community serves as a platform for cultivating future leaders in the autonomous driving field, providing a space for academic and engineering discussions [3]. - The community has grown to over 4,000 members, offering a blend of video content, articles, learning paths, Q&A, and job exchange [1][3]. - A variety of resources are available, including a complete entry-level technical stack and roadmap for newcomers, as well as valuable industry frameworks and project proposals for those already engaged in research [9][11]. Group 2: Learning and Development - The community has compiled extensive resources, including over 40 open-source projects and nearly 60 datasets related to autonomous driving, along with mainstream simulation platforms and various technical learning paths [16]. - Specific learning routes are available for different aspects of autonomous driving, such as perception, simulation, and planning control, catering to both beginners and advanced practitioners [17][16]. - The community also offers a series of video tutorials covering topics like sensor calibration, SLAM, decision-making, and trajectory prediction [5]. Group 3: Job Opportunities and Networking - The community has established internal referral mechanisms with multiple autonomous driving companies, facilitating job placements for members [5]. - Continuous job sharing and position updates are provided, creating a complete ecosystem for autonomous driving professionals [13]. - Members can freely ask questions regarding career choices and research directions, receiving guidance from experienced peers [82].

师兄自己发了篇端到端VLA，申博去TOP2了。。。

自动驾驶之心· 2025-08-21 11:24

Core Viewpoint - The article discusses a research guidance program focused on Vision-Language-Action (VLA) models for autonomous driving, aimed at helping students develop their research skills and produce publishable papers in the field [5][36]. Group 1: Program Overview - The VLA research guidance program includes 12 weeks of online group research, 2 weeks of paper guidance, and 10 weeks of paper maintenance [15][36]. - The program addresses common issues faced by students, such as lack of direction, poor hands-on skills, and difficulties in writing and submitting papers [38]. Group 2: Course Structure - The course is structured into 14 weeks, covering topics from introductory lessons to advanced VLA models and paper writing methodologies [10][12][37]. - Key topics include traditional end-to-end autonomous driving, modular VLA models, and reasoning-enhanced VLA models [10][12][36]. Group 3: Target Audience and Requirements - The program targets students at various academic levels (bachelor's, master's, and doctoral) who are interested in enhancing their research capabilities in autonomous driving and AI [16][36]. - Basic requirements include familiarity with deep learning, Python programming, and the use of PyTorch [22][36]. Group 4: Course Benefits - Participants will gain insights into classic and cutting-edge papers, coding skills, and methodologies for writing and revising papers [21][36]. - The program aims to provide each student with a research idea, enhancing their ability to conduct independent research [21][36]. Group 5: Teaching Methodology - The program employs a "2+1" teaching model, featuring a main instructor and additional support staff to ensure comprehensive learning [24][25]. - Continuous assessment and feedback mechanisms are in place to optimize the learning experience and address individual student needs [25][36].

宁波东方理工大学联培直博生招生！机器人操作/具身智能/机器人学习等方向

自动驾驶之心· 2025-08-21 09:04

Core Viewpoint - The article discusses the collaboration between Ningbo Dongfang University of Technology and prestigious institutions like Shanghai Jiao Tong University and University of Science and Technology of China to recruit doctoral students in the field of robotics, emphasizing a dual mentorship model and a focus on cutting-edge research in robotics and AI [1][2]. Group 1: Program Structure - The program allows students to register at either Shanghai Jiao Tong University or University of Science and Technology of China for the first year, followed by research work at Dongfang University under dual supervision [1]. - Graduates will receive a doctoral degree and diploma from either Shanghai Jiao Tong University or University of Science and Technology of China [1]. Group 2: Research Focus and Support - The research areas include robotics, control, and AI, with specific topics such as contact-rich manipulation, embodied intelligence, agile robot control, and robot learning [2]. - The lab provides ample research funding, administrative support, and encourages a balanced lifestyle for students, including physical exercise [2]. Group 3: Community and Networking - The article promotes a community platform for knowledge sharing in embodied intelligence, aiming to grow from 2,000 to 10,000 members within two years, facilitating discussions on various technical and career-related topics [3][5]. - The community offers resources such as technical routes, job opportunities, and access to industry experts, enhancing networking and collaboration among members [5][18]. Group 4: Educational Resources - The community has compiled extensive resources, including over 30 technical routes, open-source projects, and datasets relevant to embodied intelligence and robotics [17][21][31]. - Members can access a variety of learning materials, including books and research reports, to support their academic and professional development [27][24].

自动驾驶之心· 2025-08-20 23:33

Core Viewpoint - The article emphasizes the importance of job opportunities and resources in the fields of autonomous driving and embodied intelligence, highlighting a community platform for job seekers in these sectors. Group 1: Job Opportunities - The platform offers job postings for various positions in algorithm development, product management, and internships related to autonomous driving and robotics [6][24]. - Members of the community include professionals from leading companies in the industry, providing a network for job seekers [4][5]. Group 2: Resources and Support - The community provides a wealth of resources, including interview questions, experience sharing, industry reports, and salary negotiation tips [9][15][19]. - There are specific sections dedicated to various technical topics, such as multi-sensor fusion, trajectory prediction, and occupancy perception, which are crucial for candidates preparing for interviews [10][14]. Group 3: Community Engagement - The platform has nearly 1000 members, facilitating discussions and exchanges among individuals interested in autonomous driving and embodied intelligence [4][5]. - The community encourages collaboration and sharing of experiences, which can help members avoid common pitfalls in the job application process [17][18].

VisionTrap: VLM+LLM教会模型利用视觉特征更好实现轨迹预测

自动驾驶之心· 2025-08-20 23:33

Core Insights - The article presents a novel method for trajectory prediction in autonomous driving, integrating visual inputs from surround cameras and textual descriptions to enhance prediction accuracy [3][4][5] - The proposed approach addresses limitations of traditional methods that rely solely on HD maps and historical trajectories, which often lack real-time adaptability to changing environments [5][6] - The introduction of a new dataset, nuScenes-Text, enriches existing datasets with textual annotations, demonstrating the positive impact of visual language models (VLM) on trajectory prediction [4][6][37] Group 1: Methodology - The proposed model consists of four key components: Per-agent State Encoder, Visual Semantic Encoder, Text-driven Guidance Module, and Trajectory Decoder [7][10] - The Per-agent State Encoder captures temporal features and spatial interactions among agents, utilizing relative displacement and attention mechanisms [10][11] - The Visual Semantic Encoder extracts image features from the environment, integrating them with agent features to enhance prediction accuracy [14][16] Group 2: Data and Training - The nuScenes-Text dataset was created using fine-tuned VLM and large language models (LLM) to generate detailed textual descriptions for each agent in various scenarios [37][39] - The training process employs multi-modal contrastive learning to align visual features with textual descriptions, improving the model's ability to extract relevant information from images [19][25] - The model's training objective includes maximizing similarity between positive pairs (agent features and corresponding text) while minimizing similarity between negative pairs (features from different agents) [19][20] Group 3: Experimental Results - The experimental results indicate significant improvements in trajectory prediction accuracy, with enhancements of over 20% attributed to the Visual Semantic Encoder and Text-driven Guidance Module [46][47] - The model's performance was validated across the entire nuScenes dataset, showcasing the effectiveness of each component in improving prediction metrics [47][48] - Visual and textual information integration led to better clustering of agent state embeddings, indicating improved understanding of agent behaviors [49][50] Group 4: Conclusion - The key innovation of the proposed method lies in using textual descriptions to guide the model in learning visual semantic features, thereby enhancing trajectory prediction accuracy [53][54] - The article highlights the importance of image information in trajectory prediction and the effectiveness of the proposed approach in leveraging both visual and textual data [54]

Trajectory Prediction

Visual-Language Model (VLM)

Large-Language Model (LLM)

Multi-modal Contrastive Learning

Autonomous Driving

VisionTrap

Trajectory Prediction

Visual-Language Model (VLM)

Large-Language Model (LLM)

Multi-modal Contrastive Learning

Autonomous Driving

VisionTrap

英伟达新研究：小模型才是智能体的未来？

自动驾驶之心· 2025-08-20 23:33

Core Viewpoint - The article emphasizes that small language models are the future of Agentic AI, as they are more efficient and cost-effective compared to large models, which often waste resources on simple tasks [3][4][40]. Summary by Sections Performance Comparison - Small models can outperform large models in specific tasks, as evidenced by a 6.7 billion parameter Toolformer surpassing the performance of the 175 billion parameter GPT-3 [6]. - A 7 billion parameter DeepSeek-R1-Distill model has also shown better performance than Claude3.5 and GPT-4o [7]. Resource Optimization - Small models optimize hardware resources and task design, allowing for more efficient execution of Agent tasks [9]. - They can efficiently share GPU resources, maintain performance isolation, and reduce memory usage, enhancing concurrent capabilities [11][12]. - Flexible GPU resource allocation allows for better overall throughput and cost control by prioritizing low-latency requests from small models [14]. Task-Specific Deployment - Traditional Agent tasks often do not require a single large model; instead, specialized small models can be used for specific sub-tasks, reducing resource waste and inference costs [20][23]. - Running a 7 billion parameter small model is 10-30 times cheaper than using a 700-1750 billion parameter large model [24]. Challenges and Counterarguments - Some researchers argue that large models have superior general understanding capabilities, even in specialized tasks [26]. - However, NVIDIA counters that small models can achieve the required reliability through easy fine-tuning and that advanced systems can break down complex problems into simpler sub-tasks, diminishing the importance of large models' generalization [27][28]. Economic Considerations - While small models have lower per-inference costs, large models may benefit from economies of scale in large deployments [30]. - NVIDIA acknowledges this but points out that advancements in inference scheduling and modular systems are improving the flexibility and reducing infrastructure costs for small models [31]. Transitioning from Large to Small Models - NVIDIA outlines a method for transitioning from large to small models, including adapting infrastructure, increasing market awareness, and establishing evaluation standards [33]. - The process involves data collection, workload clustering, model selection, fine-tuning, and creating a feedback loop for continuous improvement [36][39]. Community Discussion - The article highlights community discussions around the practicality of small models versus large models, with some users finding small models more cost-effective for simple tasks [41]. - However, concerns about the robustness of small models in unpredictable scenarios are also raised, suggesting a need for careful consideration of the trade-offs between functionality and complexity [43][46].

VLM还是VLA？从现有工作看自动驾驶多模态大模型的发展趋势~

自动驾驶之心· 2025-08-20 23:33

Core Insights - The article emphasizes the increasing importance of foundational models such as LLM (Large Language Models), VLM (Vision-Language Models), and VLA (Vision-Language-Action Models) in autonomous driving decision-making, attracting significant attention from both academia and industry [2]. Summary by Categories LLM-Based Approaches - LLM-based methods leverage the reasoning capabilities of large models to describe autonomous driving, marking the early stages of integration between autonomous driving and large models [4]. - Notable research includes: - "Distilling Multi-modal Large Language Models for Autonomous Driving" - "LearningFlow: Automated Policy Learning Workflow for Urban Driving with Large Language Models" - "CoT-Drive: Efficient Motion Forecasting for Autonomous Driving with LLMs and Chain-of-Thought Prompting" - "PADriver: Towards Personalized Autonomous Driving" [4][5]. VLM-Based Approaches - VLM and VLA algorithms are currently mainstream due to the reliance on visual sensors in autonomous driving. The article summarizes the latest works in this area for reference and learning [8]. - Key studies include: - "Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning" - "FutureSightDrive: Visualizing Trajectory Planning with Spatio-Temporal CoT for Autonomous Driving" [8][9]. VLA-Based Approaches - VLA methods focus on integrating vision, language, and action for end-to-end autonomous driving, emphasizing adaptive reasoning and reinforcement fine-tuning [17]. - Significant contributions include: - "AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning" - "DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving" [17][21].

Autonomous Driving Multi-modal Large Model

Autonomous Driving

LLM（Large Language Model）

VLM（Vision-Language Model）

VLA（Vision-Language-Action Model）

Autonomous Driving Multi-modal Large Model

Autonomous Driving

LLM（Large Language Model）

VLM（Vision-Language Model）

VLA（Vision-Language-Action Model）

Previous Next