自动驾驶之心

Search documents
最新综述!北交大等团队系统梳理LLM Agent推理框架核心进展
自动驾驶之心· 2025-08-31 23:33
( Agent ) 推理框架系统性综述 , 针对当前 LLM 智能体领域 "边界模糊"、"价值低估"的问题,首次以 "框架层面推理方法" 为核心视角,填补了该方向系统性综述的空白,为研究社区提供统一的分析基准 。投稿作者为大模型之心特邀嘉宾,如果您有相 关工作需要分享,请在文末联系我们! >> 点击进入→ 大模型技术 交流群 >> 点击进入→ Age nt 技术交流群 本文只做学术分享,如有侵权,联系删文 论文作者 | BingXi Zhao等 编辑 | 大模型之心Tech 点击下方 卡片 ,关注" 大模型之心Tech "公众号 戳我 -> 领取大模型巨卷干货 今天 大模型之心Tech 为大家分享一篇 基于大语言模型(LLM)的智能体 写在前面 从微软的AutoGen到"AI程序员"Devin,基于大语言模型(LLM)的智能体(Agent)正以前所未有的速度重塑人工智能的边界。它们会分解任务、 推演计划、调用工具、彼此协作——似乎把"机器推理"带入了一个新的时代。 然而,在这股浪潮之下,一个核心的" 双重模糊性 "问题日益凸出: 一个Agent的优秀表现,究竟是归因于背后更强的模型,还是来源于其"框架 级"的 ...
北大升级DrivingGaussian++:无需训练,智驾场景自由编辑!
自动驾驶之心· 2025-08-31 23:33
Core Viewpoint - The article discusses the innovative approach of DrivingGaussian++, a framework developed by researchers from Peking University and Google DeepMind, which enables realistic reconstruction and editable simulation of dynamic driving scenes without the need for extensive training [4][18]. Group 1: Importance of Data in Autonomous Driving - Data diversity and quality are crucial for the performance and potential of models in autonomous driving, with a focus on addressing the long-tail scenarios that are often underrepresented in datasets [2][3]. - The emergence of 3D scene editing as a specialized field aims to enhance the robustness and safety of autonomous driving systems by simulating various real-world driving conditions [2]. Group 2: Challenges in 3D Scene Editing - Existing editing tools often specialize in one aspect of 3D scene editing, leading to inefficiencies when applied to large-scale autonomous driving simulations [3]. - Accurate reconstruction of 3D scenes is challenging due to limited sensor data, high-speed vehicle movement, and varying lighting conditions, making it difficult to create a complete and realistic 3D environment [3][13]. Group 3: DrivingGaussian++ Framework - DrivingGaussian++ utilizes a composite Gaussian splatting approach to layer model complex driving scenes, separating static backgrounds from dynamic targets for more precise reconstruction [4][6]. - The framework introduces novel modules, including Incremental Static 3D Gaussians and Composite Dynamic Gaussian Graphs, to enhance the modeling of both static and dynamic elements in driving scenes [6][31]. Group 4: Editing Capabilities - The framework allows for controlled and efficient editing of reconstructed scenes without additional training, covering tasks such as texture modification, weather simulation, and target manipulation [20][41]. - By integrating 3D geometric priors and leveraging large language models for dynamic predictions, the framework ensures coherence and realism in the editing process [41][51]. Group 5: Performance Comparison - DrivingGaussian++ outperforms existing methods in terms of visual realism and quantitative consistency across various editing tasks, demonstrating superior performance in dynamic driving scenarios [62][70]. - The editing time for DrivingGaussian++ is significantly lower than that of other models, typically ranging from 3 to 10 minutes, highlighting its efficiency [70].
VLA最新综述 | 中科院详解:面向具身操作的模型架构与演进
自动驾驶之心· 2025-08-30 16:03
Core Insights - The article discusses the emergence and development of Vision-Language-Action (VLA) models, which integrate visual perception, natural language understanding, and action control, marking a significant milestone in the pursuit of general robotic intelligence [3][5]. Development Stages - The development of VLA models is categorized into three stages: 1. **Emergence Stage**: Initial attempts to connect vision, language, and actions without a formal VLA concept, focusing on visual imitation learning and language annotation [7]. 2. **Exploration Stage**: By mid-2023, the VLA concept was formally introduced, with Transformer architecture becoming mainstream, enhancing model generalization in open scenarios [8]. 3. **Rapid Development Stage**: Since late 2024, VLA models have undergone rapid iterations, addressing generalization and inference efficiency issues, evolving from single-layer to multi-layer architectures [9]. Core Dimensions of VLA Models - VLA models consist of three main components: 1. **Observation Encoding**: Transitioning from CNN and RNN structures to unified architectures like ViT and cross-modal Transformers, incorporating multi-modal information for enhanced environmental perception [12]. 2. **Feature Inference**: The Transformer architecture has become the backbone, with new models like Diffusion Transformer and Mixture of Experts enhancing inference capabilities [14]. 3. **Action Decoding**: Evolving from discrete token representations to continuous control predictions, improving operational precision in real environments [15]. Training Data for VLA Models - VLA model training data is categorized into four types: 1. **Internet Image-Text Data**: Provides rich visual and linguistic priors but lacks dynamic environment understanding [17]. 2. **Video Data**: Contains temporal features of human activities, aiding in learning complex operational skills, though it often lacks precise action annotations [17]. 3. **Simulation Data**: Offers low-cost, scalable, and well-annotated data for pre-training and strategy exploration, but requires adaptation for real-world applications [19]. 4. **Real Robot Collected Data**: Directly reflects sensor noise and environmental complexities, crucial for enhancing VLA's generalization and reliability, albeit with high collection costs [19]. Pre-training and Post-training Methods - Common pre-training strategies include: 1. **Single Domain Data Training**: Early methods focused on single-modal data, providing initial perception and action representation capabilities [21]. 2. **Cross-domain Data Staged Training**: Models are pre-trained on large datasets before fine-tuning on robot operation data, effectively utilizing large-scale data priors [21]. 3. **Cross-domain Data Joint Training**: Simultaneously utilizes multiple data types to learn the relationships between perception, language, and actions [21]. 4. **Chain-of-Thought Enhancement**: Introduces reasoning chains to enable task decomposition and logical reasoning capabilities [21]. - Post-training methods aim to optimize pre-trained VLA models for specific tasks: 1. **Supervised Fine-tuning**: Uses labeled trajectory data for end-to-end training, enhancing action control mapping [22]. 2. **Reinforcement Fine-tuning**: Optimizes model strategies through interaction data, improving adaptability and performance [22]. 3. **Inference Expansion**: Enhances model performance through improved inference processes without modifying model parameters [22]. Evaluation of VLA Models - The evaluation framework for VLA models includes: 1. **Real-world Evaluation**: Tests model performance in real environments, providing reliable results but with high costs and low repeatability [24]. 2. **Simulator Evaluation**: Uses high-fidelity simulation platforms for testing, allowing for large-scale experiments but with potential discrepancies from real-world performance [24]. 3. **World Model Evaluation**: Employs learned environment models for virtual assessments, reducing costs but relying on the accuracy of the world model [24]. Future Directions for VLA Models - Future research on VLA models will focus on: 1. **Generalization Reasoning**: Enhancing the model's ability to adapt to unknown tasks and environments, integrating logical reasoning with robotic operations [26]. 2. **Fine-grained Operations**: Improving the model's capability to handle complex tasks by integrating multi-modal sensory information for precise interaction modeling [26]. 3. **Real-time Inference**: Addressing the need for efficient architectures and model compression to meet high-frequency control demands [27].
上岸自动驾驶感知!轨迹预测1v6小班课仅剩最后一个名额~
自动驾驶之心· 2025-08-30 16:03
Group 1 - The core viewpoint of the article emphasizes the importance of trajectory prediction in autonomous driving and related fields, highlighting that end-to-end methods are not yet widely adopted, and trajectory prediction remains a key area of research [1][3]. - The article discusses the integration of diffusion models into trajectory prediction, which significantly enhances multi-modal modeling capabilities, with specific models like Leapfrog Diffusion Model (LED) achieving real-time predictions and improving accuracy by 19-30 times on various datasets [2][3]. - The course aims to provide a systematic understanding of trajectory prediction, combining theoretical knowledge with practical coding skills, and assisting students in developing their own models and writing research papers [6][8]. Group 2 - The target audience for the course includes graduate students and professionals in trajectory prediction and autonomous driving, who seek to enhance their research capabilities and understand cutting-edge developments in the field [8][10]. - The course offers a comprehensive curriculum that includes classic and cutting-edge papers, baseline codes, and methodologies for selecting research topics, conducting experiments, and writing papers [20][30]. - The course structure includes 12 weeks of online group research followed by 2 weeks of paper guidance, ensuring participants gain practical experience and produce a research paper draft by the end of the program [31][35].
Tier 1一哥博世端到端终于走到量产,还是一段式!
自动驾驶之心· 2025-08-30 16:03
Core Viewpoint - The article discusses the advancements in autonomous driving technology, particularly focusing on WePilot AiDrive, a new end-to-end ADAS solution developed by WeRide, which aims to enhance the driving experience and safety through advanced AI capabilities [5][9][10]. Group 1: WeRide's New Technology - WeRide has launched a new end-to-end ADAS solution named WePilot AiDrive, which is set to be mass-produced within the year [5]. - The system integrates sensor data input and vehicle trajectory output into a single model, enhancing the efficiency and responsiveness of autonomous driving [10][24]. - The new system demonstrates improved performance in complex driving scenarios, such as navigating through urban villages and recognizing pedestrians in challenging lighting conditions [12][14][24]. Group 2: Comparison with Previous Systems - The previous two-stage model used separate perception and control models, which often led to data loss and limited understanding of driving environments [25][30]. - The new one-stage model allows for direct learning of the relationship between input data and output trajectories, significantly improving the system's performance [33]. - The transition from a rule-based approach to a more integrated model aims to overcome the limitations of earlier systems, which struggled with generalization and adaptability [32][35]. Group 3: Market Implications - The collaboration between WeRide and Bosch aims to make advanced driving capabilities accessible across various vehicle price segments, not just high-end models [41][44]. - Currently, less than 20% of vehicles in the Chinese market are equipped with advanced intelligent driving features, indicating significant growth potential for WeRide's technology [42]. - The goal is to push L2+ capabilities beyond the "value inflection point," making advanced driving technology more mainstream [44].
闭环端到端暴涨20%!华科&小米打造开源框架ORION
自动驾驶之心· 2025-08-30 16:03
Core Viewpoint - The article discusses the advancements in end-to-end (E2E) autonomous driving technology, particularly focusing on the introduction of the ORION framework, which integrates vision-language models (VLM) for improved decision-making in complex environments [3][30]. Summary by Sections Introduction - Recent progress in E2E autonomous driving technology faces challenges in complex closed-loop interactions due to limited causal reasoning capabilities [3][12]. - VLMs offer new hope for E2E autonomous driving but there remains a significant gap between VLM's semantic reasoning space and the numerical action space required for driving [3][17]. ORION Framework - ORION is proposed as an end-to-end autonomous driving framework that utilizes visual-language instructions for trajectory generation [3][18]. - The framework incorporates QT-Former for aggregating long-term historical context, VLM for scene understanding and reasoning, and a generative model to align reasoning and action spaces [3][16][18]. Performance Evaluation - ORION achieved a driving score of 77.74 and a success rate of 54.62% on the challenging Bench2Drive dataset, outperforming previous state-of-the-art (SOTA) methods by 14.28 points and 19.61% in success rate [5][24]. - The framework demonstrated superior performance in specific driving scenarios such as overtaking (71.11%), emergency braking (78.33%), and traffic sign recognition (69.15%) [26]. Key Contributions - The article highlights several key contributions of ORION: 1. QT-Former enhances the model's understanding of historical scenes by effectively aggregating long-term visual context [20]. 2. VLM enables multi-dimensional analysis of driving scenes, integrating user instructions and historical information for action reasoning [21]. 3. The generative model aligns the reasoning space of VLM with the action space for trajectory prediction, ensuring reasonable driving decisions in complex scenarios [22]. Conclusion - ORION provides a novel solution for E2E autonomous driving by achieving semantic and action space alignment, integrating long-term context aggregation, and jointly optimizing visual understanding and path planning tasks [30].
决定了!还是冲击自动驾驶算法
自动驾驶之心· 2025-08-30 04:03
Core Viewpoint - The article emphasizes the growing interest and opportunities in the autonomous driving sector, particularly in roles related to end-to-end systems, VLA (Vision-Language Alignment), and reinforcement learning, which are among the highest-paying positions in the AI industry [1][2]. Summary by Sections Community and Learning Resources - The "Autonomous Driving Heart Knowledge Planet" community has over 4,000 members and aims to grow to nearly 10,000 in the next two years, providing a platform for technical sharing and job-related discussions [1]. - The community offers a comprehensive collection of over 40 technical routes, including learning paths for end-to-end autonomous driving, VLA benchmarks, and practical engineering practices [2][5]. - Members can access a variety of resources, including video content, Q&A sessions, and practical problem-solving related to autonomous driving technologies [1][2]. Technical Learning and Career Development - The community provides structured learning paths for beginners, including full-stack courses suitable for those with no prior experience [7][9]. - There are mechanisms for job referrals within the community, connecting members with job openings in various autonomous driving companies [9][11]. - The community regularly engages with industry experts to discuss trends, technological advancements, and challenges in mass production [4][62]. Industry Insights and Trends - The article highlights the need for talent in the autonomous driving industry, particularly for tackling challenges related to L3/L4 level mass production [1]. - There is a focus on the importance of data set iteration speed in relation to technological advancements in the field, especially as AI enters the era of large models [63]. - The community aims to foster a complete ecosystem for autonomous driving, bringing together academic and industrial insights [12][64].
业务合伙人招募来啦!模型部署/VLA/端到端方向~
自动驾驶之心· 2025-08-29 16:03
Group 1 - The article announces the recruitment of 10 partners for the autonomous driving sector, focusing on course development, paper guidance, and hardware research [2][5] - The recruitment targets individuals with expertise in various advanced fields such as large models, multimodal models, and 3D target detection [3][4] - The article highlights the benefits of joining, including resource sharing for job seeking, PhD recommendations, and substantial cash incentives [5][6]
用QA问答详解端到端落地:[UniAD/PARA-Drive/SpareDrive/VADv2]
自动驾驶之心· 2025-08-29 16:03
作者 | 钱红中 编辑 | 自动驾驶之心 原文链接: https://zhuanlan.zhihu.com/p/12088531309 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 本文只做学术分享,如有侵权,联系删文 Q:端到端模型通常大致分为几种? 分为两种,一种是完全黑盒OneNet,模型直接优化Planner;另一种是模块化端到端,即模块级联或者并联,通过感知模块,预测模块以及规划模块之间feat- level/query-level的交互,减少分段式自动驾驶模型的误差累积。 Q:[UniAD] 整个框架分为4部分,输入multi-view camera imgs,Backbone模块提取BEV feat,Perception模块完成对于scene-level的感知包括对于agents+ego以及map, Prediction模块基于时序交互以及agents-scene的交互完成对于agents+ego的multi-mode轨迹预测,Planner模块基于预测的轨迹以及BEV feat完成路径的 ...
华为坚定不走VLA路线,WA才是自动驾驶终极方案?
自动驾驶之心· 2025-08-29 16:03
Core Viewpoint - Huawei's automotive business has achieved significant milestones, including 1 million vehicles equipped with its driving technology and over 100 million units of laser radar shipped, showcasing its long-term strategic vision in the automotive sector [3][4]. Group 1: Achievements and Strategy - As of July, 1 million vehicles have been equipped with Huawei's QianKun intelligent driving system, and the cumulative mileage for assisted driving has reached 4 billion kilometers [3]. - Huawei's automotive business has been investing since 2014, focusing on R&D rather than immediate commercialization, which has led to current profitability [4][5]. - The company has launched 28 models in collaboration with various brands, indicating a strong market presence [3]. Group 2: Technology Approach - Huawei prefers the World Action (WA) model over the Video Language Action (VLA) model for achieving true autonomous driving, believing WA is a more direct and effective approach [5][13]. - The WA model processes information directly from various inputs like vision, sound, and touch, bypassing the need to convert data into language [5][14]. - Huawei has developed the WEWA model based on the WA architecture, which will be deployed in ADS 4.0 [6]. Group 3: Business Model and Pricing - Huawei's CEO emphasizes that there is no such thing as a free service in the automotive industry; costs are often hidden or transferred [7][17]. - The company believes charging for assisted driving systems is justified due to ongoing costs for updates and maintenance throughout the vehicle's lifecycle [8][18]. - Huawei's approach to lifecycle management ensures that users receive continuous upgrades, enhancing their experience over time [18]. Group 4: Future Plans - Huawei aims to achieve L3 capabilities for highway driving and L4 pilot capabilities in urban areas by 2026, with plans for large-scale commercial use by 2028 [11]. - The company is also working on transforming the intelligent cockpit into a "digital nanny," integrating AI to enhance user experience [11]. Group 5: Safety and Technology Enhancements - Huawei's increase in sensor configurations, such as additional laser radars, is driven by a commitment to safety rather than merely increasing product pricing [19][20]. - The company focuses on enhancing the precision of its systems to prevent accidents and improve user safety in various driving scenarios [20][22].