Workflow
自动驾驶之心
icon
Search documents
英伟达新研究:小模型才是智能体的未来?
自动驾驶之心· 2025-08-20 23:33
Core Viewpoint - The article emphasizes that small language models are the future of Agentic AI, as they are more efficient and cost-effective compared to large models, which often waste resources on simple tasks [3][4][40]. Summary by Sections Performance Comparison - Small models can outperform large models in specific tasks, as evidenced by a 6.7 billion parameter Toolformer surpassing the performance of the 175 billion parameter GPT-3 [6]. - A 7 billion parameter DeepSeek-R1-Distill model has also shown better performance than Claude3.5 and GPT-4o [7]. Resource Optimization - Small models optimize hardware resources and task design, allowing for more efficient execution of Agent tasks [9]. - They can efficiently share GPU resources, maintain performance isolation, and reduce memory usage, enhancing concurrent capabilities [11][12]. - Flexible GPU resource allocation allows for better overall throughput and cost control by prioritizing low-latency requests from small models [14]. Task-Specific Deployment - Traditional Agent tasks often do not require a single large model; instead, specialized small models can be used for specific sub-tasks, reducing resource waste and inference costs [20][23]. - Running a 7 billion parameter small model is 10-30 times cheaper than using a 700-1750 billion parameter large model [24]. Challenges and Counterarguments - Some researchers argue that large models have superior general understanding capabilities, even in specialized tasks [26]. - However, NVIDIA counters that small models can achieve the required reliability through easy fine-tuning and that advanced systems can break down complex problems into simpler sub-tasks, diminishing the importance of large models' generalization [27][28]. Economic Considerations - While small models have lower per-inference costs, large models may benefit from economies of scale in large deployments [30]. - NVIDIA acknowledges this but points out that advancements in inference scheduling and modular systems are improving the flexibility and reducing infrastructure costs for small models [31]. Transitioning from Large to Small Models - NVIDIA outlines a method for transitioning from large to small models, including adapting infrastructure, increasing market awareness, and establishing evaluation standards [33]. - The process involves data collection, workload clustering, model selection, fine-tuning, and creating a feedback loop for continuous improvement [36][39]. Community Discussion - The article highlights community discussions around the practicality of small models versus large models, with some users finding small models more cost-effective for simple tasks [41]. - However, concerns about the robustness of small models in unpredictable scenarios are also raised, suggesting a need for careful consideration of the trade-offs between functionality and complexity [43][46].
VLM还是VLA?从现有工作看自动驾驶多模态大模型的发展趋势~
自动驾驶之心· 2025-08-20 23:33
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 近年来,以LLM、VLM和VLA为代表的基础模型在自动驾驶决策中扮演着越来越重要的角色,吸引了学术界和 工业界越来越多的关注。许多小伙伴们询问是否有系统的分类汇总。本文按照模型类别,对决策的基础模型进行 汇总,后续还将进一步梳理相关算法,并第一时间汇总至『自动驾驶之心知识星球』,欢迎大家一起学习交流~ 基于LLM的方法 基于LLM的方法主要是利用大模型的推理能力描述自动驾驶,输入自动驾驶和大模型结合的早期阶段,但仍然 值得学习~ Distilling Multi-modal Large Language Models for Autonomous Driving LearningFlow: Automated Policy Learning Workflow for Urban Driving with Large Language Models CoT-Drive: Efficient Motion Forecasting for Autonomous Driving with LLMs and Chain ...
传统的感知被嫌弃,VLA逐渐成为新秀...
自动驾驶之心· 2025-08-20 09:15
Core Viewpoint - The article discusses the advancements in the VLA (Vision-Language Action) driver model by Li Auto, highlighting its four core capabilities: spatial understanding, reasoning, communication and memory, and behavioral capabilities. It emphasizes the significance of VLA in the field of autonomous driving, indicating a shift in focus from traditional perception and planning tasks to large models and VLA technologies [2][4]. Summary by Sections VLA Model Capabilities - The VLA model integrates dynamic targets, static elements, navigation maps, and spatial understanding, showcasing a more human-like reasoning ability. This positions VLA as a leading focus in both academia and industry for autonomous driving [2]. Shift in Research Focus - Traditional perception and planning tasks are becoming less prominent in top conferences, with academia increasingly shifting towards large models and VLA. Despite this, the industry continues to optimize traditional methods, indicating ongoing opportunities in both areas [4]. Educational Program - An educational program is introduced to help students systematically grasp key theoretical knowledge in VLA, enhance practical coding skills, and develop their own research ideas. The program includes a structured 12-week online group research course followed by 2 weeks of paper guidance and a 10-week maintenance period [5][34]. Course Structure - The course spans 14 weeks, covering topics from introductory lessons to advanced VLA models and paper writing methodologies. Each week focuses on different aspects of VLA and autonomous driving, culminating in a final project report and submission guidance [8][10][35]. Target Audience - The program is designed for master's and doctoral students in VLA and autonomous driving, individuals seeking to enhance their resumes for further studies abroad, and professionals in the AI and autonomous driving sectors looking to deepen their algorithmic knowledge [14][24]. Course Requirements - Participants are expected to have a foundational understanding of deep learning, basic programming skills in Python, and familiarity with PyTorch. Access to high-performance computing resources is recommended for optimal learning [20][21]. Course Highlights - The program features a "2+1" teaching model with experienced instructors, ensuring comprehensive support throughout the learning process. It emphasizes academic integrity and provides a structured evaluation system to enhance the learning experience [22][23].
自动驾驶一周论文精选!端到端、VLA、感知、决策等~
自动驾驶之心· 2025-08-20 03:28
Core Viewpoint - The article emphasizes the recent advancements in autonomous driving research, highlighting various innovative approaches and frameworks that enhance the capabilities of autonomous systems in dynamic environments [2][4]. Group 1: End-to-End Autonomous Driving - The article discusses several notable papers focusing on end-to-end autonomous driving, including GMF-Drive, ME³-BEV, SpaRC-AD, IRL-VLA, and EvaDrive, which utilize advanced techniques such as gated fusion, deep reinforcement learning, and evolutionary adversarial strategies [8][10]. Group 2: Perception and VLM - The VISTA paper introduces a vision-language model for predicting driver attention in dynamic environments, showcasing the integration of visual and language processing for improved situational awareness [7]. - The article also mentions the development of safety-critical perception technologies, such as the progressive BEV perception survey and the CBDES MoE model for functional module decoupling [10]. Group 3: Simulation Testing - The article highlights the ReconDreamer-RL framework, which enhances reinforcement learning through diffusion-based scene reconstruction, indicating a trend towards more sophisticated simulation testing methodologies [11]. Group 4: Datasets - The STRIDE-QA dataset is introduced as a large-scale visual question answering resource aimed at spatiotemporal reasoning in urban driving scenarios, reflecting the growing need for comprehensive datasets in autonomous driving research [12].
准研二研三的主要矛盾是写论文,增加毕业找工作的筹码...
自动驾驶之心· 2025-08-19 23:32
Core Viewpoint - The article emphasizes the importance of planning and setting goals for graduate students, particularly focusing on the critical milestones for each academic year, such as publishing papers and meeting graduation requirements [3]. Group 1: Goals for Graduate Students - The primary goal for students transitioning from the first to the second year is to publish small papers, which is identified as the core challenge [3]. - For third-year students, the focus shifts to completing the thesis, fulfilling graduation requirements, and job hunting [3]. - It is recommended that second-year students aim to finish the first draft of their papers by the end of the year to allow ample time for their thesis and job search [3]. Group 2: Support Services Offered - The company has developed a comprehensive one-on-one tutoring service that covers the entire process from topic selection to submission and revision, partnering with over 200 top-ranked global educators [4]. - The tutoring service has successfully guided over 400 students in the past three years, achieving a high acceptance rate of 96% for submitted papers [4]. - The service aims to address common issues faced by students, such as lack of guidance and direction in research [4]. Group 3: Target Audience for Tutoring - The tutoring service is designed for graduate students in computer science who are seeking to enhance their research capabilities, accumulate experience, and improve their academic profiles [11]. - It is also suitable for those looking to advance their careers in artificial intelligence or related fields, as well as for students preparing for further studies or job applications [11]. Group 4: Tutoring Process and Benefits - The tutoring process includes personalized guidance, real-time interaction with mentors, and access to recorded sessions for review, ensuring flexibility and comprehensive support [9]. - Students can expect to build a solid research foundation, familiarize themselves with research processes, and enhance their practical skills through the program [10]. - Additional benefits include potential recommendations from prestigious institutions and direct job referrals to leading companies for outstanding students [15].
公司通知团队缩减,懂端到端的留下来了。。。
自动驾驶之心· 2025-08-19 23:32
Core Viewpoint - The article discusses the rapid evolution and challenges in the field of end-to-end autonomous driving technology, emphasizing the need for a comprehensive understanding of various algorithms and models to succeed in this competitive industry [2][4][6]. Group 1: Industry Trends - The shift from modular approaches to end-to-end systems in autonomous driving aims to eliminate cumulative errors between modules, marking a significant technological leap [2]. - The emergence of various algorithms and models, such as UniAD and BEV perception, indicates a growing focus on integrating multiple tasks into a unified framework [4][9]. - The demand for knowledge in multi-modal large models, reinforcement learning, and diffusion models is increasing, reflecting the industry's need for versatile skill sets [5][20]. Group 2: Learning Challenges - New entrants face difficulties due to the fragmented nature of knowledge and the overwhelming volume of research papers in the field, often leading to early abandonment of learning [5][6]. - The lack of high-quality documentation and practical guidance further complicates the transition from theory to practice in end-to-end autonomous driving research [5][6]. Group 3: Course Offerings - A new course titled "End-to-End and VLA Autonomous Driving" has been developed to address the learning challenges, focusing on practical applications and theoretical foundations [6][24]. - The course is structured to provide a comprehensive understanding of end-to-end algorithms, including their historical development and current trends [11][12]. - Practical components, such as real-world projects and assignments, are included to ensure that participants can apply their knowledge effectively [8][21]. Group 4: Course Content Overview - The course covers various topics, including the introduction to end-to-end algorithms, background knowledge on relevant technologies, and detailed explorations of both one-stage and two-stage end-to-end methods [11][12][13]. - Specific chapters focus on advanced topics like world models and diffusion models, which are crucial for understanding the latest advancements in autonomous driving [15][17][20]. - The final project involves practical applications of reinforcement learning from human feedback (RLHF), allowing participants to gain hands-on experience [21].
复旦最新LMAD:迈向可解释端到端VLM~
自动驾驶之心· 2025-08-19 23:32
Core Viewpoint - The article discusses the LMAD framework, which significantly enhances the reasoning performance of visual language models (VLMs) in autonomous driving by addressing existing limitations in scene understanding and spatial perception [2][3]. Existing Method Limitations - Current VLM-based autonomous driving methods face two key issues: fragmented scene understanding, which relies on intermediate results and fails to capture relationships between traffic elements, and weak spatial and motion perception, leading to accumulated errors during inference [4]. Innovations of LMAD - The LMAD framework introduces several core innovations: - Preliminary Interaction (PI) mechanism to model initial relationships among traffic participants, reducing the learning complexity of VLMs [6]. - Task-specific expert structure using parallel LoRA (P-LoRA) modules to focus VLMs on specific tasks such as perception, prediction, and planning [6]. - End-to-end system integration that incorporates prior knowledge from end-to-end driving systems to enhance spatial and motion information for improved reasoning capabilities [6]. Overall Framework - LMAD integrates an end-to-end driving pipeline with visual language models, consisting of three main components: a visual language model for image and text token processing, a PI encoder for multi-view image handling, and a P-LoRA module for task-specific knowledge integration [8][10]. Key Module Design - The PI encoder addresses redundancy in multi-view image processing by employing a decoupled query and alternating attention mechanism [12][15]. - The P-LoRA design allows for multiple parallel branches corresponding to different driving tasks, enhancing adaptability [16]. Training Strategy - The training strategy includes single-branch fine-tuning, where only the language branch is adjusted, and joint training, which optimizes both text generation and end-to-end tasks simultaneously [18]. Experimental Results - In the DriveLM benchmark, LMAD significantly improved the performance of baseline VLMs, with accuracy increases of 3.44% for LLaMA-Adapter and 3.89% for GPT [20]. - In the nuScenes-QA test, LMAD achieved an overall accuracy improvement of 2.57% compared to the baseline [25]. Ablation Studies - The effectiveness of components such as the PI encoder, P-LoRA, and end-to-end tokens was confirmed, with the full configuration yielding the highest final score of 57.17 [28]. - The task-oriented P-LoRA design outperformed other configurations across various metrics [28]. Qualitative Analysis - LMAD demonstrated strong performance in perception tasks by accurately identifying key targets, although it struggled with less obvious signs [34]. - In prediction tasks, LMAD effectively influenced subsequent planning despite discrepancies between predicted and actual targets [34]. - For planning tasks, LMAD produced driving behaviors that aligned with the current environment by leveraging historical context [34].
最新综述!扩散语言模型全面盘点~
自动驾驶之心· 2025-08-19 23:32
Core Viewpoint - The article discusses the competition between two major paradigms in generative AI: Diffusion Models and Autoregressive (AR) Models, highlighting the emergence of Diffusion Language Models (DLMs) as a potential breakthrough in the field of large language models [2][3]. Group 1: DLM Advantages Over AR Models - DLMs offer parallel generation capabilities, significantly improving inference speed by achieving a tenfold increase compared to AR models, which are limited by token-level serial processing [11][12]. - DLMs utilize bidirectional context, enhancing language understanding and generation control, allowing for finer adjustments in output characteristics such as sentiment and structure [12][14]. - The iterative denoising mechanism of DLMs allows for corrections during the generation process, reducing the accumulation of early errors, which is a limitation in AR models [13]. - DLMs are naturally suited for multimodal applications, enabling the integration of text and visual data without the need for separate modules, thus enhancing the quality of joint generation tasks [14]. Group 2: Technical Landscape of DLMs - DLMs are categorized into three paradigms: Continuous Space DLMs, Discrete Space DLMs, and Hybrid AR-DLMs, each with distinct advantages and applications [15][20]. - Continuous Space DLMs leverage established diffusion techniques from image models but may suffer from semantic loss during the embedding process [20]. - Discrete Space DLMs operate directly on token levels, maintaining semantic integrity and simplifying the inference process, making them the mainstream approach in large parameter models [21]. - Hybrid AR-DLMs combine the strengths of AR models and DLMs, balancing efficiency and quality for tasks requiring high coherence [22]. Group 3: Training and Inference Optimization - DLMs utilize transfer learning to reduce training costs, with methods such as initializing from AR models or image diffusion models, significantly lowering data requirements [30][31]. - The article outlines three main directions for inference optimization: parallel decoding, masking strategies, and efficiency technologies, all aimed at enhancing speed and quality [35][38]. - Techniques like confidence-aware decoding and dynamic masking are highlighted as key innovations to improve the quality of generated outputs while maintaining high inference speeds [38][39]. Group 4: Multimodal Applications and Industry Impact - DLMs are increasingly applied in multimodal contexts, allowing for unified processing of text and visual data, which enhances capabilities in tasks like visual reasoning and joint content creation [44]. - The article presents various case studies demonstrating DLMs' effectiveness in high-value vertical applications, such as code generation and computational biology, showcasing their potential in real-world scenarios [46]. - DLMs are positioned as a transformative technology in industries, with applications ranging from real-time code generation to complex molecular design, indicating their broad utility [46][47]. Group 5: Challenges and Future Directions - The article identifies key challenges facing DLMs, including the trade-off between parallelism and performance, infrastructure limitations, and scalability issues compared to AR models [49][53]. - Future research directions are proposed, focusing on improving training objectives, building dedicated toolchains, and enhancing long-sequence processing capabilities [54][56].
端到端VLA的起点:聊聊大语言模型和CLIP~
自动驾驶之心· 2025-08-19 07:20
Core Viewpoint - The article discusses the development and significance of end-to-end (E2E) algorithms in autonomous driving, emphasizing the integration of various advanced technologies such as large language models (LLMs), diffusion models, and reinforcement learning (RL) in enhancing the capabilities of autonomous systems [21][31]. Summary by Sections Section 1: Overview of End-to-End Autonomous Driving - The first chapter provides a comprehensive overview of the evolution of end-to-end algorithms, explaining the transition from modular approaches to end-to-end solutions, and discussing the advantages and challenges of different paradigms [40]. Section 2: Background Knowledge - The second chapter focuses on the technical stack associated with end-to-end systems, detailing the importance of LLMs, diffusion models, and reinforcement learning, which are crucial for understanding the future job market in this field [41][42]. Section 3: Two-Stage End-to-End Systems - The third chapter delves into two-stage end-to-end systems, exploring their emergence, advantages, and disadvantages, while also reviewing notable works in the field such as PLUTO and CarPlanner [42][43]. Section 4: One-Stage End-to-End and VLA - The fourth chapter highlights one-stage end-to-end systems, discussing various subfields including perception-based methods and the latest advancements in VLA (Vision-Language Alignment), which are pivotal for achieving the ultimate goals of autonomous driving [44][50]. Section 5: Practical Application and RLHF Fine-Tuning - The fifth chapter includes a major project focused on RLHF (Reinforcement Learning from Human Feedback) fine-tuning, providing practical insights into building pre-training and reinforcement learning modules, which are applicable to VLA-related algorithms [52]. Course Structure and Learning Outcomes - The course aims to equip participants with a solid understanding of end-to-end autonomous driving technologies, covering essential frameworks and methodologies, and preparing them for roles in the industry [56][57].
都在做端到端了,轨迹预测还有出路么?
自动驾驶之心· 2025-08-19 03:35
Core Viewpoint - The article emphasizes the importance of trajectory prediction in the context of autonomous driving and highlights the ongoing relevance of traditional two-stage and modular methods despite the rise of end-to-end approaches. It discusses the integration of trajectory prediction models with perception models as a form of end-to-end training, indicating a significant area of research and application in the industry [1][2]. Group 1: Trajectory Prediction Methods - The article introduces the concept of multi-agent trajectory prediction, which aims to forecast future movements based on the historical trajectories of multiple interacting agents. This is crucial for applications in autonomous driving, intelligent monitoring, and robotic navigation [1]. - It discusses the challenges of predicting human behavior due to its uncertainty and multimodality, noting that traditional methods often rely on recurrent neural networks, convolutional networks, or graph neural networks for social interaction modeling [1]. - The article highlights the advancements in diffusion models for trajectory prediction, showcasing models like Leapfrog Diffusion Model (LED) and Mixed Gaussian Flow (MGF) that have significantly improved accuracy and efficiency in various datasets [2]. Group 2: Course Objectives and Structure - The course aims to provide a systematic understanding of trajectory prediction and diffusion models, helping participants to integrate theoretical knowledge with practical coding skills, ultimately leading to the development of new models and research papers [6][8]. - It is designed for individuals at various academic levels who are interested in trajectory prediction and autonomous driving, offering insights into cutting-edge research and algorithm design [8]. - Participants will gain access to classic and cutting-edge papers, coding implementations, and methodologies for writing and submitting research papers [8][9]. Group 3: Course Highlights and Requirements - The course features a "2+1" teaching model with experienced instructors and dedicated support staff to enhance the learning experience [16][17]. - It requires participants to have a foundational understanding of deep learning and proficiency in Python and PyTorch, ensuring they can engage with the course material effectively [10]. - The course structure includes a comprehensive curriculum covering data sets, baseline codes, and essential research papers, facilitating a thorough understanding of trajectory prediction techniques [20][21][23].