DriveVLM
Search documents
最近做 VLA 的一些心得体会
自动驾驶之心· 2025-12-11 00:05
Core Insights - The article discusses the challenges and advancements in Vision-Language Models (VLM) for autonomous driving, highlighting issues such as hallucination, 3D spatial understanding, and processing speed [3]. Group 1: Challenges in VLM - Hallucination issues manifest as generating non-existent information and failing to perceive relevant data, which can be mitigated through dynamic perception techniques [3]. - Insufficient 3D spatial understanding is attributed to pre-training tasks being predominantly 2D, suggesting the incorporation of spatial localization tasks during training [3]. - Processing speed is a concern, with potential solutions including KV Cache, visual token compression, and mixed data training to enhance model efficiency [3]. Group 2: Learning Paradigms and Model Improvements - The learning paradigm should shift from imitation learning (SFT) to preference learning (DPO, GRPO), with simultaneous multi-task training yielding better results than sequential single-task training [3]. - To prevent catastrophic forgetting in foundation models, adding pre-training data is a simple and effective method [3]. - Enhanced supervisory signals can lead to better model representations, achieved by adding auxiliary task heads to the VLM model [3]. Group 3: Interaction and Evaluation - Current VLMs exhibit insufficient interaction between vision and language, limiting their effectiveness as base models; improving this interaction is crucial [3]. - The output method for trajectories is flexible, with various approaches yielding satisfactory results, though diffusion heads are preferred in industry for speed [3]. - Evaluation remains challenging due to inconsistencies between training and testing conditions, necessitating better alignment of objectives and data distributions [3].
世界模型和VLA正在逐渐走向融合统一
自动驾驶之心· 2025-10-31 00:06
Core Viewpoint - The integration of Vision-Language Action (VLA) and World Model (WM) technologies is becoming increasingly evident, suggesting a trend towards unification rather than opposition in the field of autonomous driving [3][5][7]. Technology Development Trends - Recent discussions highlight that VLA and WM should not be seen as mutually exclusive but rather as complementary technologies that can enhance the development of General Artificial Intelligence (AGI) [3]. - The combination of VLA and WM is supported by various academic explorations, including models like DriveVLA-W0, which demonstrate the feasibility of their integration [3]. Industry Insights - The ongoing debate within the industry regarding VLA and WA (World Action) is more about different promotional narratives rather than fundamental technological differences [7]. - Tesla's recent presentations at ICCV are expected to influence domestic perspectives on the integration of VLA and WA [7]. Community and Learning Resources - The "Autonomous Driving Heart Knowledge Planet" community has been established to provide a comprehensive platform for learning and sharing knowledge in the autonomous driving sector, with over 4000 members and plans to expand to nearly 10,000 [10][23]. - The community offers a variety of resources, including video content, learning routes, and Q&A sessions, aimed at both beginners and advanced practitioners in the field [10][12][28]. Technical Learning Paths - The community has compiled over 40 technical learning routes covering various aspects of autonomous driving, including perception, simulation, planning, and control [24][44]. - Specific learning paths are available for newcomers, including full-stack courses suitable for those with no prior experience [20][17]. Networking and Career Opportunities - The community facilitates connections between members and industry leaders, providing job referral mechanisms and insights into career opportunities within the autonomous driving sector [19][10]. - Members can engage in discussions about research directions, job choices, and industry trends, fostering a collaborative environment for knowledge exchange [97][101].
理想自动驾驶团队GitHuB仓库与论文合集
理想TOP2· 2025-10-17 13:44
Core Viewpoint - The article emphasizes the advancements in autonomous driving technology by Li Auto, focusing on innovative solutions to enhance safety, efficiency, and sustainability in transportation [1]. Group 1: Autonomous Driving Technologies - The company is developing a large language model (LLM) to interpret complex driving scenarios, enabling smarter and quicker responses from autonomous vehicles [2]. - A world model project aims to simulate real driving environments for testing and improving autonomous driving algorithms under various conditions [3]. - The 3D geometric scene (3DGS) understanding project focuses on creating detailed 3D maps of urban environments to enhance the perception systems of autonomous vehicles for better navigation and decision-making [4]. - The company is pioneering an end-to-end neural network model that simplifies the entire processing flow from perception to execution in autonomous driving systems [5]. Group 2: Research and Development Projects - DriveVLM is a dual-system architecture combining end-to-end and vision-language models for autonomous driving [7]. - TOP3Cap is a dataset that describes autonomous driving street scenes in natural language, containing 850 outdoor scenes, over 64,300 objects, and 2.3 million textual descriptions [7]. - StreetGaussians presents an efficient method for creating realistic, dynamic urban street models for autonomous driving scenarios [8]. - DiVE is a model based on the Diffusion Transformer architecture that generates videos consistent in time and multiple perspectives, matching given bird's-eye view layouts [8]. - GaussianAD utilizes sparse and comprehensive 3D Gaussian functions to represent and convey scene information, addressing the trade-off between information completeness and computational efficiency [8]. - 3DRealCar is a large-scale real-world 3D car dataset containing 2,500 cars scanned in 3D, with an average of 200 dense RGB-D views per car [8]. - DriveDreamer4D employs a video generation model as a data machine to create video data of vehicles executing complex maneuvers, supplementing real data [8]. - DrivingSphere combines 4D world modeling and video generation technologies to create a generative closed-loop simulation framework [8]. - StreetCrafter is a video diffusion model designed for street scene synthesis, utilizing precise laser radar data for pixel-level control [8]. - GeoDrive generates highly realistic, temporally consistent driving scene videos using 3D geometric information [10]. - LightVLA is the first adaptive visual token pruning framework that enhances the success rate and operational efficiency of robot VLA models [10].
快速结构化深度了解理想AI/自动驾驶/VLA手册
理想TOP2· 2025-10-10 11:19
Core Insights - The article discusses the evolution of Li Xiang's vision for Li Auto, emphasizing the transition from a traditional automotive company to an artificial intelligence (AI) company, driven by the belief in the transformative potential of AI and autonomous driving [1][2]. Motivation - Li Xiang considers founding Autohome as his biggest mistake, aiming for a venture at least ten times larger than it [1]. - The belief in the feasibility of autonomous driving and the industry's transformative phase motivated the establishment of Li Auto [1]. Timeline of Developments - In September 2022, Li Auto internally defined itself as an AI company [2]. - On January 28, 2023, Li Xiang officially announced the company's identity as an AI company [2]. - By March 2023, discussions around AI began, although initial understanding of concepts like pretraining and finetuning was limited [2]. - By December 2024, Li Xiang articulated five key judgments regarding AI's role and potential, emphasizing the importance of foundational models [2][3]. Key Judgments - Judgment 1: Li Xiang believes in OpenAI's five stages of AI, asserting that AI will democratize knowledge and capabilities [2]. - Judgment 2: The foundational model is seen as the operating system of the AI era, crucial for developing super products [2]. - Judgment 3: Current efforts are aimed at achieving Level 3 (L3) autonomous driving and securing a ticket to Level 4 (L4) [2][3]. - Judgment 4: The integration of large language models with autonomous driving will create a new entity termed VLA [3]. - Judgment 5: Li Auto aims to produce a car without a steering wheel within three years, contingent on the VLA foundational model and sufficient resources [3]. Technical Insights - The design and training of the VLA foundational model focus on 3D spatial understanding and reasoning capabilities [5][6]. - Sparse modeling techniques are employed to enhance efficiency without significantly increasing computational load [7]. - The model incorporates future frame prediction and dense depth prediction tasks to mimic human thought processes [8]. - The use of diffusion techniques allows for real-time trajectory generation and enhances the model's ability to predict complex traffic scenarios [10]. Reinforcement Learning - The company aims to surpass human driving capabilities through reinforcement learning, addressing previous limitations in model training and interaction environments [11]. Future Directions - Li Auto is actively developing various models and frameworks to enhance its autonomous driving capabilities, including the introduction of new methodologies for video generation and scene reconstruction [12][13].
西交利物浦&港科最新!轨迹预测基座大模型综述
自动驾驶之心· 2025-09-24 23:33
Core Insights - The article discusses the application of large language models (LLMs) and multimodal large language models (MLLMs) in the paradigm shift for autonomous driving trajectory prediction, enhancing the understanding of complex traffic scenarios to improve safety and efficiency [1][20]. Summary by Sections Introduction and Overview - The integration of LLMs into autonomous driving systems allows for a deeper understanding of traffic scenarios, transitioning from traditional methods to LFM-based approaches [1]. - Trajectory prediction is identified as a core technology in autonomous driving, utilizing historical data and contextual information to infer future movements of traffic participants [5]. Traditional Methods and Challenges - Traditional vehicle trajectory prediction methods include physics-based approaches (e.g., Kalman filters) and machine learning methods (e.g., Gaussian processes), which struggle with complex interactions [8]. - Deep learning methods improve long-term prediction accuracy but face challenges such as high computational demands and poor interpretability [9]. - Reinforcement learning methods excel in interactive scene modeling but are complex and unstable [9]. LLM-Based Vehicle Trajectory Prediction - LFM introduces a paradigm shift by discretizing continuous motion states into symbolic sequences, leveraging LLMs' semantic modeling capabilities [11]. - Key applications of LLMs include trajectory-language mapping, multimodal fusion, and constraint-based reasoning, enhancing interpretability and robustness in long-tail scenarios [11][13]. Evaluation Metrics and Datasets - The article categorizes datasets for pedestrian and vehicle trajectory prediction, highlighting the importance of datasets like Waymo and ETH/UCY for evaluating model performance [16]. - Evaluation metrics for vehicles include L2 distance and collision rates, while pedestrian metrics focus on minADE and minFDE [17]. Performance Comparison - A performance comparison of various models on the NuScenes dataset shows that LLM-based methods significantly reduce collision rates and improve long-term prediction accuracy [18]. Discussion and Future Directions - The widespread application of LFMs indicates a shift from local pattern matching to global semantic understanding, enhancing safety and compliance in trajectory generation [20]. - Future research should focus on developing low-latency inference techniques, constructing motion-oriented foundational models, and advancing world perception and causal reasoning models [21].
机器人操控新范式:一篇VLA模型系统性综述 | Jinqiu Select
锦秋集· 2025-09-02 13:41
Core Insights - The article discusses the emergence of Vision-Language-Action (VLA) models based on large Vision-Language Models (VLMs) as a transformative paradigm in robotic manipulation, addressing the limitations of traditional methods in unstructured environments [1][4][5] - It highlights the need for a structured classification framework to mitigate research fragmentation in the rapidly evolving VLA field [2] Group 1: New Paradigm in Robotic Manipulation - Robotic manipulation is a core challenge at the intersection of robotics and embodied AI, requiring deep understanding of visual and semantic cues in complex environments [4] - Traditional methods rely on predefined control strategies, which struggle in unstructured real-world scenarios, revealing limitations in scalability and generalization [4][5] - The advent of large VLMs has provided a revolutionary approach, enabling robots to interpret high-level human instructions and generalize to unseen objects and scenes [5][10] Group 2: VLA Model Definition and Classification - VLA models are defined as systems that utilize a large VLM to understand visual observations and natural language instructions, followed by a reasoning process that generates robotic actions [6][7] - VLA models are categorized into two main types: Monolithic Models and Hierarchical Models, each with distinct architectures and functionalities [7][8] Group 3: Monolithic Models - Monolithic VLA models can be implemented in single-system or dual-system architectures, integrating perception and action generation into a unified framework [14][15] - Single-system models process all modalities together, while dual-system models separate reflective reasoning from reactive behavior, enhancing efficiency [15][16] Group 4: Hierarchical Models - Hierarchical models consist of a planner and a policy, allowing for independent operation and modular design, which enhances flexibility in task execution [43] - These models can be further divided into Planner-Only and Planner+Policy categories, with the former focusing solely on planning and the latter integrating action execution [43][44] Group 5: Advancements in VLA Models - Recent advancements in VLA models include enhancements in perception modalities, such as 3D and 4D perception, as well as the integration of tactile and auditory information [22][23][24] - Efforts to improve reasoning capabilities and generalization abilities are crucial for enabling VLA models to perform complex tasks in diverse environments [25][26] Group 6: Performance Optimization - Performance optimization in VLA models focuses on enhancing inference efficiency through architectural adjustments, parameter optimization, and inference acceleration techniques [28][29][30] - Dual-system models have emerged to balance deep reasoning with real-time action generation, facilitating smoother deployment in real-world scenarios [35] Group 7: Future Directions - Future research directions include the integration of memory mechanisms, 4D perception, efficient adaptation, and multi-agent collaboration to further enhance VLA model capabilities [1][6]
给自动驾驶感知工程师的规划速成课
自动驾驶之心· 2025-08-08 16:04
Core Insights - The article discusses the evolution and importance of planning modules in autonomous driving, emphasizing the need for engineers to understand both traditional and machine learning-based approaches to effectively address challenges in the field [5][8][10]. Group 1: Importance of Planning - Understanding planning is crucial for engineers, especially in the context of autonomous driving, as it allows for better service to downstream customers and enhances problem-solving capabilities [8][10]. - The transition from rule-based systems to machine learning systems in planning will likely see a coexistence of both methods for an extended period, with a gradual shift in their usage ratio from 8:2 to 2:8 [8][10]. Group 2: Planning System Overview - The planning system in autonomous vehicles is essential for generating safe, comfortable, and efficient driving trajectories, relying on inputs from perception outputs [11][12]. - Traditional planning modules consist of global path planning, behavior planning, and trajectory planning, with behavior and trajectory planning often working in tandem [12]. Group 3: Challenges in Planning - A significant challenge in the planning technology stack is the lack of standardized terminology, leading to confusion in both academic and industrial contexts [15]. - The article highlights the need for a unified approach to behavior planning, as the current lack of consensus on semantic actions limits the effectiveness of planning systems [18]. Group 4: Planning Techniques - The article outlines three primary tools used in planning: search, sampling, and optimization, each with its own methodologies and applications in autonomous driving [24][41]. - Search methods, such as Dijkstra and A* algorithms, are popular for path planning, while sampling methods like Monte Carlo are used for evaluating numerous options quickly [25][32]. Group 5: Industrial Practices - The article discusses the distinction between decoupled and joint spatiotemporal planning methods, with decoupled solutions being easier to implement but potentially less optimal in complex scenarios [52][54]. - The Apollo EM planner is presented as an example of a decoupled planning approach, which simplifies the problem by breaking it into two-dimensional issues [56][58]. Group 6: Decision-Making in Autonomous Driving - Decision-making in autonomous driving focuses on interactions with other road users, addressing uncertainties and dynamic behaviors that complicate planning [68][69]. - The use of Markov Decision Processes (MDP) and Partially Observable Markov Decision Processes (POMDP) frameworks is essential for handling the probabilistic nature of interactions in driving scenarios [70][74].
VLM岗位面试,被摁在地上摩擦。。。
自动驾驶之心· 2025-07-12 12:00
Core Viewpoint - The article discusses the advancements and applications of large models in autonomous driving, particularly focusing on the integration of multi-modal large models in the industry and their potential for future development [2][4][17]. Group 1: Interview Insights - The interview process for a position at Li Auto involved extensive discussions on large models, including their foundational concepts and practical applications in autonomous driving [2][4]. - The interviewer emphasized the importance of private dataset construction and data collection methods, highlighting that data remains the core of business models [4][6]. Group 2: Course Overview - A course on multi-modal large models is introduced, covering topics from general multi-modal models to fine-tuning techniques, ultimately focusing on end-to-end autonomous driving applications [5][9][11]. - The course structure includes chapters on the introduction to multi-modal large models, foundational modules, general models, fine-tuning techniques, and specific applications in autonomous driving [9][11][17]. Group 3: Technical Focus - The article outlines the technical aspects of multi-modal large models, including architecture, training paradigms, and the significance of fine-tuning techniques such as Adapter and LoRA [11][15]. - It highlights the application of these models in autonomous driving, referencing algorithms like DriveVLM, which is pivotal for Li Auto's end-to-end driving solutions [17][19]. Group 4: Career Development - The course also addresses career opportunities in the field, discussing potential employers, job directions, and the skills required for success in the industry [19][26]. - It emphasizes the importance of having a solid foundation in deep learning and model deployment, along with practical coding skills [27].
基于VLM的快慢双系统自动驾驶 - DriveVLM解析~
自动驾驶之心· 2025-06-27 09:15
Core Viewpoint - The article discusses the rapid advancements in large models and their applications in the autonomous driving sector, particularly focusing on the DriveVLM algorithm developed by Tsinghua University and Li Auto to address long-tail problems in real-world driving scenarios [2]. Group 1: DriveVLM Overview - DriveVLM aims to tackle the challenges faced in the transition from Level 2 (L2) to Level 4 (L4) autonomous driving, particularly the infinite long-tail problems that arise in real-world scenarios [2]. - The industry has recognized that data-driven approaches alone may not suffice to evolve towards true L4 autonomous driving, necessitating further exploration of next-generation solutions [2]. Group 2: Innovations of DriveVLM - DriveVLM introduces several innovations, including: - Chain-of-Thought (CoT) for scene description, analysis, and hierarchical planning [4]. - DriveVLM-Dual, which integrates DriveVLM with traditional modules for real-time planning and enhanced spatial reasoning capabilities [4]. - A comprehensive data mining and annotation process to construct the Corner Case dataset, SUP-AD [4]. Group 3: Course Structure and Content - The article outlines a course on multi-modal large models, covering: - Introduction to multi-modal large models, including foundational concepts and applications [21]. - Basic modules of multi-modal large models, explaining components like modality encoders and projectors [23]. - General multi-modal large models, focusing on algorithms for various tasks [25]. - Fine-tuning and reinforcement learning techniques essential for model development [28]. - Applications of multi-modal large models in autonomous driving, highlighting DriveVLM as a key algorithm [30]. - Job preparation related to multi-modal large models, addressing industry needs and interview preparation [32].
体验向上价格向下,端到端加速落地
HTSC· 2025-03-02 07:30
Investment Rating - The report maintains a rating of "Buy" for several companies in the automotive sector, including XPeng Motors, Li Auto, BYD, SAIC Motor, Great Wall Motors, and Leap Motor [10]. Core Viewpoints - The report emphasizes that by 2025, advanced intelligent driving (high-level AD) will see improved user experience and reduced prices, transitioning from a trial phase to widespread adoption among consumers [14][20]. - The penetration rates for L2.5 and L2.9 intelligent driving are projected to reach 3.5% and 10.1% respectively by November 2024, with expectations of further growth to 16% for highway NOA and 14% for urban NOA by 2025 [14][24]. - The report highlights the shift towards end-to-end architecture in intelligent driving systems, which allows for higher performance limits and seamless data transmission, enhancing the overall driving experience [30][31]. Summary by Sections Investment Recommendations - The report suggests focusing on companies with strong engineering capabilities and advantages in data, computing power, and funding, such as XPeng Motors, Li Auto, and BYD, as well as third-party suppliers like Desay SV and Kobot [5][10]. Market Trends - The report notes that the intelligent driving market is evolving, with a focus on enhancing user experience through features like "human-like" driving capabilities and the implementation of end-to-end architectures [14][20]. - The price of high-level intelligent driving systems is expected to decrease significantly, with current models priced below 100,000 and 150,000 yuan for highway and urban NOA respectively [24][28]. Technological Developments - The report discusses the advancements in end-to-end architecture, which is gaining traction among automotive manufacturers, allowing for improved data processing and decision-making capabilities [30][31]. - It also mentions the importance of AI-driven models and the need for automotive companies to adapt their organizational structures to support these technological shifts [15][41]. Competitive Landscape - The report outlines the competitive dynamics among leading automotive companies, highlighting their respective advancements in intelligent driving technologies and the rapid iteration of their systems [41][45]. - Companies like Tesla, Li Auto, and XPeng Motors are noted for their significant investments in R&D and their ability to push updates and improvements quickly [42][46].