Workflow
自动驾驶之心
icon
Search documents
多模态大模型强化学习训练框架 - EasyR1代码走读(GRPO)
自动驾驶之心· 2025-07-15 12:30
Core Insights - The article discusses the exploration of the EasyR1 framework for multi-modal reinforcement learning, particularly focusing on its implementation and configuration for training models like Qwen2.5-VL [1][4][6]. Group 1: Framework Overview - EasyR1 is derived from the verl framework and is designed for language-based reinforcement learning [1][6]. - The code version referenced is approximately from June 10, indicating ongoing updates and improvements [1]. Group 2: Configuration Details - The configuration file is structured into four main categories: data, algorithm, worker, and trainer, with specific parameters outlined for each [6][11]. - Data configurations include paths for training and validation files, maximum prompt and response lengths, and batch sizes for training iterations [9][10]. - Algorithm configurations specify parameters for the advantage estimator, discount factors, and KL divergence settings [11][13]. Group 3: Training Workflow - The training process is initiated through a main script that sets up the data loaders and begins the training loop [42][43]. - The workflow includes steps for preparing data, generating sequences, and computing rewards, with specific attention to balancing batch sizes across distributed processes [46][50][64]. - The article emphasizes the importance of handling multi-modal data and ensuring that the training process accommodates various input types [65][66]. Group 4: Data Handling - The dataset must include specific keys such as problem, answer, and images, formatted in JSON for compatibility with the loading functions [40][41]. - The data loading process supports multiple file formats and is designed to create a seamless pipeline for training [41][32]. Group 5: Model Update Mechanism - The article outlines the mechanism for updating the actor model, detailing how policy loss is computed and how gradients are managed during training [82][86]. - It highlights the significance of KL divergence in the training process, particularly in relation to the reference model [71][80].
蔚来,亮出了新的底牌。。。
自动驾驶之心· 2025-07-15 12:30
以下文章来源于智能车参考 ,作者有车有据 智能车参考 . 追踪车圈先进技术|好用产品|新进展和认知 作者 | 有车有据 来源 | 智能车参考 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 本文只做学术分享,如有侵权,联系删文 19.39万 ,李斌掀开蔚来的关键底牌。 乐道L90开启预售,27.99万的整车购买价格,已经出人意料。 19.39万元的BaaS方案一出,直接引爆现场。 中国人为什么能造出这样好看、合规又实惠的车? | 这真是耐人寻味。 他们是怎么造出这么好看、各方面都符合要求、价格又这么实惠的车的? 中国人除了劳 | | --- | | 动力成本低之外,还有什么美国人没有的? 这可是个严肃的问题。 | 同时也点燃了资本市场,发布会后蔚来美股股价 大涨超6% 。 | US INIU 厨不 | | | | | 酒版 | 笑版 | | --- | --- | --- | --- | --- | --- | --- | | 3.690美元 +0.210 +6.03% | | | | | 盘后 3.72 ...
一文尽览!近一年自动驾驶VLA优秀工作汇总~
自动驾驶之心· 2025-07-15 12:30
Core Insights - The article discusses the advancements in Vision-Language-Action (VLA) models for autonomous driving, highlighting the integration of navigation and reinforcement learning to enhance reasoning capabilities beyond visual range [2][3][6]. Group 1: NavigScene - NavigScene is introduced as a novel auxiliary dataset that pairs local multi-view sensor inputs with global natural language navigation guidance, addressing the critical gap between local perception and global navigation context in autonomous driving [6]. - Three complementary paradigms are implemented in NavigScene: navigation-guided reasoning, navigation-guided preference optimization, and navigation-guided VLA models, enhancing the reasoning and generalization capabilities of autonomous driving systems [6]. - Comprehensive experiments demonstrate significant performance improvements in perception, prediction, and planning tasks by integrating global navigation knowledge into autonomous driving systems [6]. Group 2: AutoVLA - AutoVLA is proposed as an end-to-end autonomous driving framework that integrates physical action tokens with a pre-trained VLM backbone, enabling direct policy learning and semantic reasoning from raw visual observations and language instructions [12]. - A reinforcement learning-based post-training method using Group Relative Policy Optimization (GRPO) is introduced to achieve adaptive reasoning and further enhance model performance in end-to-end driving tasks [12]. - AutoVLA achieves competitive performance across multiple autonomous driving benchmarks, including open-loop and closed-loop tests [12]. Group 3: ReCogDrive - ReCogDrive is presented as an end-to-end autonomous driving system that integrates VLM with a diffusion planner, employing a three-stage training paradigm to address performance drops in rare and long-tail scenarios [13][16]. - The first stage involves fine-tuning the VLM on a large-scale driving Q&A dataset to mitigate domain gaps between general content and real-world driving scenarios [16]. - The method achieves a state-of-the-art PDMS score of 89.6 on the NAVSIM benchmark, highlighting its effectiveness and feasibility [16]. Group 4: Impromptu VLA - Impromptu VLA introduces a large-scale, richly annotated dataset aimed at addressing the limitations of existing benchmarks in autonomous driving VLA models [22]. - The dataset is designed to enhance the performance of VLA models in unstructured extreme scenarios, demonstrating significant improvements in established benchmarks [22]. - Experiments show that training with the Impromptu VLA dataset leads to notable performance enhancements in closed-loop NeuroNCAP scores and collision rates [22]. Group 5: DriveMoE - DriveMoE is a novel end-to-end autonomous driving framework that incorporates a mixture-of-experts (MoE) architecture to effectively handle multi-view sensor data and complex driving scenarios [28]. - The framework features scene-specific visual MoE and skill-specific action MoE, addressing the challenges of multi-view redundancy and skill specialization [28]. - DriveMoE achieves state-of-the-art performance in closed-loop evaluations on the Bench2Drive benchmark, demonstrating the effectiveness of combining visual and action MoE in autonomous driving tasks [28].
双非研究生,今年找工作有些迷茫。。。
自动驾驶之心· 2025-07-14 14:04
Core Viewpoint - The article emphasizes the importance of staying updated with cutting-edge technologies in the fields of autonomous driving and embodied intelligence, highlighting the need for strong technical skills and knowledge in advanced areas such as large models, reinforcement learning, and 3D graphics [4][5]. Group 1: Industry Trends - There is a growing demand for talent in the fields of robotics and embodied intelligence, with many startups receiving significant funding and showing rapid growth potential [4][5]. - Major companies are shifting their focus towards more advanced technologies, moving from traditional methods to end-to-end solutions and large models, indicating a technological evolution in the industry [4][5]. - The community aims to build a comprehensive ecosystem that connects academia, products, and recruitment, fostering a collaborative environment for knowledge sharing and job opportunities [6]. Group 2: Technical Directions - The article outlines four key technical directions in the industry: visual large language models, world models, diffusion models, and end-to-end autonomous driving [9]. - It provides resources and summaries of various research papers and datasets related to these technologies, indicating a strong emphasis on research and development [10][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][35][36][38]. Group 3: Community and Learning Resources - The community offers a variety of learning materials, including video courses, hardware, and coding resources, aimed at equipping individuals with the necessary skills for the evolving job market [6]. - There is a focus on creating a supportive environment for discussions on the latest industry trends, technical challenges, and job opportunities, which is crucial for professionals looking to advance their careers [6].
VLA盛行的时代,为什么这家公司坚持量产非端到端方案?
自动驾驶之心· 2025-07-14 11:54
Core Viewpoint - Company A has adopted a low-cost strategy and a gradual approach, avoiding large-scale investments in end-to-end solutions, which has allowed it to meet requirements with existing modular solutions [1] Group 1: Company Strategy - Company A's main strategy involves a two-phase approach using multi-sensing and prediction (map, od, occ, etc.) to maintain a competitive edge as a mid-to-low-cost supplier [1] - The company has accumulated a significant amount of relevant training data early on, making modular solutions appear to be the most cost-effective option [1] - Many Tier 1 suppliers are likely to continue with their existing production solutions unless end-to-end solutions can demonstrably outperform modular solutions across multiple fields [1] Group 2: Industry Context - The industry faces a common challenge where companies, including Company A, struggle with the high costs of advanced research and development, which impacts their ability to scale production [1] - The reliance on modular solutions is partly due to the importance of supply chain stability, as manufacturers prefer solutions that are compatible with existing ecosystems [1]
现在自动驾驶领域的行情怎么样了?都有哪些方案?
自动驾驶之心· 2025-07-14 11:30
Core Insights - The article discusses the current state of the autonomous driving industry, including job opportunities and technological advancements [1][10]. Autonomous Driving Levels and Applications - Main functions of autonomous driving include driving, parking, cabin interaction, and V2X (Vehicle-to-Everything) communication [1]. Key System Components - The core systems consist of chips, software, and sensors [3]. Technology Trends Overview - Traditional autonomous driving pipeline is being complemented by end-to-end autonomous driving and various algorithmic solutions like VLM (Vision-Language Model) and VLA (Vision-Language Alignment) [5][6][7]. Major Players in the Industry - New entrants in the market include companies like Xpeng, Li Auto, NIO, Huawei, and others, while established manufacturers include BYD, Geely, and international brands like Mercedes and Volkswagen [7]. - Suppliers in the industry include listed companies such as Horizon Robotics and Momenta, as well as major tech firms like Baidu and Didi [7]. Job Positions and Directions - Traditional roles focus on localization and mapping, perception layers, and post-fusion techniques, while new roles emphasize end-to-end algorithms, reinforcement learning, and data loop experience [8]. Community and Resources - The AutoRobo Knowledge Planet serves as a hub for job seekers in autonomous driving and related fields, offering interview questions, industry reports, and resume optimization services [10][11][22].
自动驾驶圆桌论坛 | 聊聊自动驾驶上半年都发生了啥?
自动驾驶之心· 2025-07-14 11:30
Core Viewpoint - The article discusses the current state and future directions of autonomous driving technology, highlighting the maturity of certain technologies, the challenges that remain, and the emerging trends in the industry. Group 1: Current Technology Maturity - The introduction of BEV (Bird's Eye View) and OCC (Occupancy) perception methods has matured, with no major players claiming that BEV is unusable [2][13] - The main challenge remains corner cases, where 99% of scenarios are manageable, but complex situations like rural roads and large intersections still pose difficulties [13] - E2E (End-to-End) models have not yet demonstrated clear advantages over two-stage models in practical applications, despite their theoretical appeal [4][5] Group 2: Emerging Technologies - VLA (Vision-Language Alignment) is gaining attention as it simplifies tasks and potentially addresses corner cases more effectively than traditional methods [5][6] - The efficiency of models is a critical issue, with discussions around using smaller models to achieve performance close to larger ones [6][30] - Reinforcement learning has not yet proven to be significantly impactful in autonomous driving, with a need for better simulation environments to validate its effectiveness [7][51] Group 3: Future Directions - There is a consensus that VLA and VLM (Vision-Language Model) will be key areas for future development, focusing on enhancing reasoning capabilities and safety [45][48] - The industry is moving towards a more data-driven approach, where the efficiency of data collection, cleaning, and training will determine competitive advantage [28][40] - The integration of world models and closed-loop simulations is seen as essential for advancing autonomous driving technologies [47][50] Group 4: Industry Perspectives - The shift towards VLA/VLM is viewed as a necessary evolution, with the potential to improve user experience and safety in autonomous vehicles [28][45] - The debate between deepening expertise in autonomous driving versus transitioning to embodied intelligence reflects the industry's evolving landscape and personal career choices [22][27] - The current focus on safety and robustness in L4 (Level 4) autonomous driving indicates a divergence in technical approaches between L2+ and L4 players [25][36]
小鹏最新!NavigScene:全局导航实现超视距自动驾驶VLA(ACMMM'25)
自动驾驶之心· 2025-07-14 11:30
Core Insights - The article discusses the development of NavigScene, a novel dataset aimed at bridging the gap between local perception and global navigation in autonomous driving systems, enhancing their reasoning and planning capabilities [2][12][14]. Group 1: Overview of NavigScene - NavigScene is designed to integrate local sensor data with global navigation context, addressing the limitations of existing autonomous driving models that primarily rely on immediate visual information [5][9]. - The dataset includes two subsets: NavigScene-nuScenes and NavigScene-NAVSIM, which provide paired data to facilitate comprehensive scene understanding and decision-making [9][14]. Group 2: Methodologies - Three complementary paradigms are proposed to leverage NavigScene: 1. Navigation-guided reasoning (NSFT) enhances visual-language models by incorporating navigation context [10][19]. 2. Navigation-guided preference optimization (NPO) improves generalization in new scenarios through reinforcement learning [24][26]. 3. Navigation-guided visual-language-action (NVLA) model integrates navigation guidance with traditional driving models for better performance [27][28]. Group 3: Experimental Results - Experiments demonstrate that integrating global navigation knowledge significantly improves the performance of autonomous driving systems in tasks such as perception, prediction, and planning [12][34][39]. - The results indicate that models trained with NavigScene outperform baseline models across various metrics, including BLEU-4, METEOR, and CIDEr, showcasing enhanced reasoning capabilities [32][34]. Group 4: Practical Implications - The integration of NavigScene allows autonomous systems to make more informed decisions in complex driving environments, leading to improved safety and reliability [12][42]. - The findings highlight the importance of incorporating beyond-visual-range (BVR) knowledge for effective navigation and planning in autonomous driving applications [8][12].
ICCV25!百度U-Vilar:视觉定位多任务SOTA,无痛兼容端到端框架~
自动驾驶之心· 2025-07-14 11:30
Core Insights - The article discusses the U-ViLAR framework developed by Baidu, which focuses on uncertainty-aware visual localization for autonomous driving, addressing the challenges posed by GNSS signal interference in urban environments [2][26]. Group 1: Importance of Visual Localization - In urban settings, GNSS signals can be unreliable due to obstructions like buildings and tunnels, making visual localization technology crucial [2]. - Traditional methods rely on feature matching between images and 3D maps, which are sensitive to changes in perspective and lighting, and are costly to construct on a large scale [2]. Group 2: U-ViLAR Framework - U-ViLAR effectively models perception and localization uncertainties separately, improving performance in both large-scale re-localization and fine localization tasks [2][26]. - The framework consists of two key modules: PU-Guided Association, which uses perception uncertainty to guide visual and map feature association, and LU-Guided Registration, which utilizes localization uncertainty for precise registration [4]. Group 3: Technical Implementation - The framework employs a shared backbone network (like ResNet) for feature extraction from multi-view images, projecting them into BEV (Bird's Eye View) space [6]. - It supports HD maps and navigation maps, extracting BEV features from map elements using a U-Net structure [7]. - Cross-modal fusion is achieved through alternating self-attention and cross-attention mechanisms to enhance visual and map BEV features [8]. Group 4: Experimental Results - U-ViLAR demonstrated superior performance in fine-grained localization tasks on the nuScenes and SRoad datasets, significantly reducing localization errors [20]. - In large-scale re-localization tasks, it outperformed existing methods on datasets like KITTI, nuScenes, and SRoad, showcasing robustness in both coarse and fine localization [20]. - The framework achieves a processing speed of 28 frames per second on NVIDIA V100 GPUs and 15 frames per second on optimized NVIDIA Orin platforms [20]. Group 5: Ablation Studies - Ablation studies confirmed the effectiveness of key components such as perception uncertainty-guided association and localization uncertainty-guided registration, indicating that removing any component would lead to performance degradation [21]. Group 6: Future Directions - Future work will focus on optimizing localization accuracy in challenging scenarios and enhancing the model's generalization capabilities to support various datasets and map types [26].
VLA之外,具身+VA工作汇总
自动驾驶之心· 2025-07-14 10:36
Core Insights - The article focuses on advancements in embodied intelligence and robotic manipulation, highlighting various research projects and methodologies aimed at improving robot learning and performance in real-world tasks [2][3][4]. Group 1: 2025 Research Highlights - Numerous projects are set for 2025, including "Steering Your Diffusion Policy with Latent Space Reinforcement Learning" and "Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation," which aim to enhance robotic capabilities in manipulation and interaction [2]. - The "BEHAVIOR Robot Suite" aims to streamline real-world whole-body manipulation for everyday household activities, indicating a focus on practical applications of robotic technology [2]. - "You Only Teach Once: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations" emphasizes the potential for robots to learn complex tasks from minimal demonstrations, showcasing advancements in imitation learning [2]. Group 2: Methodological Innovations - The article discusses various innovative methodologies such as "Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning," which aims to improve the adaptability of robots in different environments [2]. - "Learning Dexterous In-Hand Manipulation with Multifingered Hands via Visuomotor Diffusion" highlights the focus on enhancing dexterity in robotic hands, crucial for complex manipulation tasks [4]. - "Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation" indicates a trend towards using synthetic data to train robots, which can significantly reduce the need for real-world data collection [7]. Group 3: Future Directions - The research agenda for 2024 and beyond includes projects like "Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching," which suggests a shift towards utilizing advanced data representations for improved learning outcomes [9]. - "Zero-Shot Framework from Image Generation World Model to Robotic Manipulation" indicates a future direction where robots can generalize from visual data without prior specific training, enhancing their versatility [9]. - The emphasis on "Human-to-Robot Data Augmentation for Robot Pre-training from Videos" reflects a growing interest in leveraging human demonstrations to improve robotic learning efficiency [7].