Workflow
自动驾驶之心
icon
Search documents
DriveBench:VLM在自动驾驶中真的可靠吗?(ICCV'25)
自动驾驶之心· 2025-08-07 23:32
Core Insights - The article discusses the advancements in Visual Language Models (VLMs) and their potential application in autonomous driving, particularly focusing on the reliability and interpretability of driving decisions generated by VLMs [3][5]. Group 1: DriveBench Overview - DriveBench is introduced as a benchmark dataset designed to evaluate the reliability of VLMs in 17 different settings, comprising 19,200 frames and 20,498 question-answer pairs [3]. - The framework covers four core tasks in autonomous driving: perception, prediction, planning, and behavior, and incorporates 15 types of Out-of-Distribution (OoD) scenarios to systematically test VLMs in complex driving environments [7][9]. Group 2: Presentation Details - The article highlights a live presentation by Shaoyuan Xie, a PhD student at the University of California, Irvine, who will discuss the empirical study on VLMs and their readiness for autonomous driving [9]. - The presentation will cover an overview of VLMs in autonomous driving, the reliability assessment of DriveBench, and future prospects for VLM applications in the industry [9].
快慢双系统评测!Bench2ADVLM:专为自动驾驶VLM设计(南洋理工)
自动驾驶之心· 2025-08-07 23:32
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 今天自动驾驶之心为大家分享XX最新的工作!如果您有相关工作需要分享,请在文末联系我们! 自动驾驶课程学习与 技术交流群加入 ,也欢迎添加小助理微信AIDriver005 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 论文作者 | Tianyuan Zhang等 编辑 | 自动驾驶之心 写在前面 & 笔者的个人理解 视觉-语言模型(VLMs)最近已成为自动驾驶(AD)中一个有前景的范式。然而当前对基于VLM的自动驾驶系统(ADVLMs)的性能评估协议主要局限于具有静 态输入的开环设置,忽略了更具现实性和信息性的闭环设置,后者能够捕捉交互行为、反馈弹性和真实世界的安全性。为了解决这一问题,我们引入了 BENCH2ADVLM,这是一个统一的分层闭环评估框架,用于在仿真和物理平台上对ADVLMs进行实时、交互式评估。受认知的双过程理论启发,我们首先通过双 系统适应架构将多种ADVLMs适配到仿真环境中。在此设计中,由目标ADVLMs(快速系统)生成的异构高级驾驶命令被通用VLM(慢速系统)解释为适合在仿 真中执 ...
自动驾驶之心端到端VLA技术交流群成立了~
自动驾驶之心· 2025-08-07 23:32
感兴趣的同学欢迎添加小助理微信进群:AIDriver005, 备注:昵称+VLA加群。 自动驾驶之心大模型VLA技术交流群成立了,欢迎大家加入一起交流端到端VLA相关的内容:包括VLA数 据集制作、一段式VLA、分层VLA、基于大模型的端到端方案、基于VLM+DP的方案、量产落地、求职等 内容。 ...
自动驾驶之心内容运营实习生招聘!合伙人1v1培养(仅限一人)
自动驾驶之心· 2025-08-07 12:00
Core Viewpoint - The article emphasizes the importance of connecting academia and industry through technology content, focusing on cutting-edge fields such as autonomous driving, embodied intelligence, and large models [3]. Group 1: Company Mission and Focus - The company aims to bridge the gap between academia and industry, facilitating communication among enterprises, educational institutions, and AI developers [3]. - The team is dedicated to providing the latest and most authoritative technical information across various platforms, including WeChat, Zhihu, and Bilibili [3]. - The focus areas include academic paper interpretation, industry production solutions, large model evaluations, business dynamics, industry recruitment, and open-source projects [3]. Group 2: Collaboration and Community Engagement - The company has established deep collaborations with leading companies and relevant universities in the fields of autonomous driving and embodied intelligence [3]. - There is an ongoing effort to rapidly build partnerships in the large model sector [3]. - The company encourages community engagement and co-creation in the AI field, sharing the joy of cognitive growth [3]. Group 3: Internship Opportunities - The company is seeking interns with a background in autonomous driving, large models, or embodied intelligence, preferably at the master's level [6]. - Responsibilities include selecting, interpreting, and summarizing academic papers, building knowledge platforms, creating original videos, and managing data reviews [6]. - The company values strong execution, communication skills, and a passion for sharing advancements in technology [6]. Group 4: Application Process - Interested candidates are encouraged to apply via email, providing a resume and personal introduction [8]. - The internship offers a combination of salary, mentorship, industry resource recommendations, and internal job referrals [8].
自动驾驶之心项目与论文辅导来了~
自动驾驶之心· 2025-08-07 12:00
Core Viewpoint - The article announces the launch of the "Heart of Autonomous Driving" project and paper guidance, aimed at assisting students facing challenges in their research and development efforts in the field of autonomous driving [1]. Group 1: Project and Guidance Overview - The project aims to provide support for students who encounter difficulties in their research, such as environmental configuration issues and debugging challenges [1]. - Last year's outcomes were positive, with several students successfully publishing papers in top conferences like CVPR and ICRA [1]. Group 2: Guidance Directions - **Direction 1**: Focus on multi-modal perception and computer vision, end-to-end autonomous driving, large models, and BEV perception. The guiding teacher has published over 30 papers in top AI conferences with a citation count exceeding 6000 [3]. - **Direction 2**: Emphasis on 3D Object Detection, Semantic Segmentation, Occupancy Prediction, and multi-task learning based on images or point clouds. The guiding teacher is a top-tier PhD with multiple publications in ECCV and CVPR [5]. - **Direction 3**: Concentration on end-to-end autonomous driving, OCC, BEV, and world model directions. The guiding teacher is also a top-tier PhD with contributions to several mainstream perception solutions [6]. - **Direction 4**: Focus on NeRF / 3D GS neural rendering and 3D reconstruction. The guiding teacher has published four CCF-A class papers, including two in CVPR and two in IEEE Transactions [7].
这个2000人的具身社区,有点料......
自动驾驶之心· 2025-08-07 09:52
Core Insights - The article emphasizes the value of a community that provides solutions to problems in the field of embodied intelligence, highlighting the establishment of a comprehensive technical exchange platform for industry and academic discussions [3][17]. Group 1: Community and Resources - The community has created a closed loop in various fields such as industry, academia, job seeking, and Q&A exchanges, offering timely solutions and job opportunities [3][18]. - A compilation of over 30 technical routes is available, significantly reducing search time for benchmarks and learning paths [5][17]. - The community invites industry experts to answer questions and share insights, enhancing the learning experience [5][18]. Group 2: Educational Support - For beginners, the community has organized numerous technical stacks and routes to facilitate entry into the field [12][18]. - For those already engaged in research, valuable industry frameworks and project proposals are provided [14][18]. - The community offers exclusive learning videos and documents to create an engaging educational environment [18]. Group 3: Job Opportunities and Networking - The community has established a job referral mechanism with multiple embodied intelligence companies, facilitating direct connections for job seekers [11][18]. - Members can freely ask questions regarding career choices and research directions, receiving guidance from peers [79][18]. Group 4: Research and Development - The community has compiled a list of over 40 open-source projects and nearly 60 datasets related to embodied intelligence, aiding in research and development efforts [17][31][37]. - Various research directions and notable laboratories in the field of embodied intelligence are summarized for reference by those considering further studies [21][22]. Group 5: Technical Insights - The community provides insights into various technical aspects such as robot simulation, data collection platforms, and the challenges of implementing VLA models [3][5][17]. - Detailed summaries of cutting-edge research papers and industry reports are available, keeping members informed about the latest developments [24][18].
万字长文!RAG实战全解析:一年探索之路
自动驾驶之心· 2025-08-07 09:52
Core Viewpoint - The article discusses the Retrieval Augmented Generation (RAG) method, which combines retrieval-based models and generative models to enhance the quality and relevance of generated text. It addresses issues such as hallucination, knowledge timeliness, and long text processing in large models [1]. Group 1: Background and Challenges - RAG was proposed by Meta in 2020 to enable language models to access external information beyond their internal knowledge [1]. - RAG faces three main challenges: retrieval quality, enhancement process, and generation quality [2]. Group 2: Challenges in Retrieval Quality - Semantic ambiguity can arise from vector representations, leading to irrelevant results [5]. - User input has become more complex, transitioning from keywords to natural dialogue, which complicates retrieval [5]. - Document segmentation methods can affect the matching degree between document blocks and user queries [5]. - Extracting and representing multimodal content (e.g., tables, charts) poses significant challenges [5]. - Integrating context from retrieved paragraphs into the current generation task is crucial for coherence [5]. - Redundancy and repetition in retrieved content can lead to duplicated information in generated outputs [5]. - Determining the importance of multiple retrieved paragraphs for the generation task is challenging [5]. - Over-reliance on retrieval content can exacerbate hallucination issues [5]. - Irrelevance of generated answers to the query is a concern [5]. - Toxicity or bias in generated answers is another issue [5]. Group 3: Overall Architecture - The product architecture consists of four layers, including model layer, offline understanding layer, online Q&A layer, and scenario layer [7]. - The RAG framework is divided into three main components: query understanding, retrieval model, and generation model [10]. Group 4: Query Understanding - The query understanding module aims to improve retrieval by interpreting user queries and generating structured queries [14]. - Intent recognition helps select relevant modules based on user queries [15]. - Query rewriting utilizes LLM to rephrase user queries for better retrieval [16]. - Query expansion breaks complex questions into simpler sub-questions for more effective retrieval [22]. Group 5: Retrieval Model - The retrieval model's effectiveness depends on the accuracy of embedding models [33]. - Document loaders facilitate loading document data from various sources [38]. - Text converters prepare documents for retrieval by segmenting them into smaller, semantically meaningful chunks [39]. - Document embedding models create vector representations of text to enable semantic searches [45]. - Vector databases support efficient storage and search of embedded data [47]. Group 6: Generation Model - The generation model utilizes retrieved information to generate coherent responses to user queries [60]. - Different strategies for prompt assembly are employed to enhance response generation [62][63]. Group 7: Attribution Generation - Attribution in RAG is crucial for aligning generated content with reference information, ensuring accuracy [73]. - Dynamic computation methods can enhance the generation process by matching generated text with reference sources [76]. Group 8: Evaluation - The article emphasizes the importance of defining metrics and evaluation methods for assessing RAG system performance [79]. - Various evaluation frameworks, such as RGB and RAGAS, are introduced to benchmark RAG systems [81]. Group 9: Conclusion - The article summarizes key modules in RAG practice and highlights the need for continuous research and development to refine these technologies [82].
自动驾驶运动规划发展到了什么阶段?
自动驾驶之心· 2025-08-06 23:34
Core Insights - The article discusses the advancements in end-to-end (end2end) autonomous driving systems, highlighting the prominence of Behavior-Driven End-to-End (BEV) frameworks while noting the ongoing challenges in planning due to interaction modeling complexities [2][40]. Group 1: Interaction Modeling - Interaction modeling is identified as a critical area in planning, involving game theory and uncertainty modeling, which current supervised learning methods struggle to address effectively [2][5]. - The report emphasizes the importance of incorporating ego and agent trajectories into loss functions or constraints to enhance planning outcomes [2][5]. Group 2: Planning Frameworks - Various frameworks for interactive planning are discussed, including POMDP, contingency planners, and game theory approaches, focusing on how to integrate interaction within the planning pipeline [5][40]. - The article outlines a typical interactive planning process that includes perturbing ego trajectories, predicting all agents' movements, and employing dynamic programming to derive optimal policies [6][12]. Group 3: Loss Functions and Constraints - The loss function for planning is detailed, incorporating terms for collision avoidance between ego and agent trajectories, with specific components for prediction accuracy and collision penalties [9][16]. - The article explains how interaction is modeled within the loss function, ensuring that agent predictions do not lead to collisions with the ego vehicle [9][16]. Group 4: Real-Time Optimization - The article discusses latency issues in planning and proposes using Alternating Direction Method of Multipliers (ADMM) to achieve real-time performance, achieving up to 125Hz with multiple agents [19][18]. - It highlights the need for efficient optimization techniques to reduce computation time, with a focus on achieving real-time capabilities in autonomous driving systems [19][18]. Group 5: Future Considerations - The article raises questions about the effectiveness of prediction-oriented methods in dynamic scenarios, suggesting that these methods may not adequately address counterfactual situations where agent behavior diverges from predictions [41][42]. - It discusses the necessity for improved prediction models and the potential for modular frameworks to enhance trajectory optimization in autonomous vehicles [45][44].
自动驾驶大模型方案:视觉语言模型VLM工作一览,面向量产和研究~
自动驾驶之心· 2025-08-06 23:34
Core Insights - The article emphasizes the transformative potential of Vision-Language Models (VLMs) in enhancing the perception and cognitive capabilities of autonomous driving systems, enabling them to not only "see" but also "understand" complex driving environments [2][3]. Group 1: VLM Applications in Autonomous Driving - VLMs can surpass traditional visual models by integrating camera images or video streams to comprehend semantic information in traffic scenes, such as recognizing complex scenarios like "a pedestrian waving to cross the street" [6]. - VLMs facilitate the conversion of intricate visual scenes into clear natural language descriptions, enhancing the interpretability of decisions made by autonomous systems, which aids in debugging and increases trust among passengers and regulators [6]. - VLMs are crucial for natural language interactions in future smart cabins, allowing passengers to communicate intentions to vehicles through spoken commands [6]. Group 2: Scenario Generation and Testing - The article introduces CrashAgent, a multi-agent framework that utilizes multi-modal large language models to convert accident reports into structured scenarios for simulation environments, addressing the long-tail distribution issue in existing datasets [7]. - CurricuVLM is proposed as a personalized curriculum learning framework that leverages VLMs to analyze agent behavior and dynamically generate tailored training scenarios, improving safety in autonomous driving [13]. - TRACE is a framework that generates key test cases from real accident reports, significantly enhancing the efficiency of defect detection in autonomous driving systems [17]. Group 3: Out-of-Distribution (OOD) Scenario Generation - A framework utilizing large language models is proposed to generate diverse OOD driving scenarios, addressing the challenges posed by the sparsity of such scenarios in urban driving datasets [21][22]. - The article discusses the development of a method to automatically convert real-world driving videos into detailed simulation scenarios, enhancing the testing of autonomous driving systems [26]. Group 4: Enhancing Safety and Robustness - WEDGE is introduced as a synthetic dataset created from generative vision-language models, aimed at improving the robustness of perception systems in extreme weather conditions [39][40]. - LKAlert is a predictive alert system that utilizes VLMs to forecast potential lane-keeping assist (LKA) risks, enhancing driver situational awareness and trust [54][55]. Group 5: Advancements in Decision-Making Frameworks - The CBR-LLM framework combines semantic scene understanding with case retrieval to enhance decision-making in complex driving scenarios, improving accuracy and reasoning consistency [44][45]. - ORION is presented as a holistic end-to-end autonomous driving framework that integrates visual-language instructed action generation, achieving superior performance in closed-loop evaluations [69][70].
喧嚣过后, 理想i8后续口碑会非常高
自动驾驶之心· 2025-08-06 23:34
Core Viewpoint - The i8 model is expected to receive positive market feedback and will likely achieve significant sales faster than the MEGA model, although the exact timeframe is uncertain [5] - The i8 is anticipated to boost MEGA orders and capture some of the L789 orders, while this year's L series orders are expected to be relatively average [5] Group 1: Fundamental Premise - The article discusses the subconscious decision-making process, suggesting that people's decisions are often made subconsciously before they rationalize them consciously [9] - Three reasons for the public's high expectations for the i8's immediate sales are identified: the impact of the YU7 launch, the company's previous sales success, and the i8's one-year delay [10] Group 2: Public Opinion Dynamics - The article outlines that the i8's design did not cater to the expectation of immediate sales, leading to dissatisfaction among potential buyers who were hoping for a quick decision-making process [12] - The i8's features require more experience to appreciate, contrasting with the YU7's easily communicable highlights [11] Group 3: Market Reactions and Future Outlook - Concerns regarding a truck incident are discussed, with the article suggesting that these fears may not persist long-term due to the underlying interests of the stakeholders involved [14] - Despite initial dissatisfaction, many consumers are still proceeding with orders for the i8, indicating that its perceived value may outweigh initial concerns [15] Group 4: Competitive Positioning - The i8 is positioned as a more cost-effective option compared to the L9, offering superior features at a lower price point, which enhances its appeal to consumers [17] - The i8's top configuration is seen as a value proposition, especially when compared to other models in the market, reinforcing its competitive edge [16]