自动驾驶之心
Search documents
深扒PI*0.6迭代式强化学习来源:VLA+在线RL实现具身进化
自动驾驶之心· 2025-12-13 02:04
Core Insights - The article discusses the significance of the π*0.6 iterative reinforcement learning approach in the context of VLA (Vision-Language-Action) models, highlighting its potential for self-improvement in robotics [2][3] - It emphasizes the limitations of imitation learning and the necessity of reinforcement learning for robust and persistent robot performance [8][11] Group 1: Importance of VLA+RL - VLA+RL is crucial as it allows robots to learn from real-world interactions, overcoming the limitations of offline reinforcement learning which is constrained by the quality of demonstration data [4][8] - The article outlines that while imitation learning can enable robots to perform actions, it does not guarantee consistent success in novel situations [8][11] Group 2: Challenges in Applying Reinforcement Learning to VLA - The article identifies three main challenges in applying reinforcement learning to VLA: environmental differences, model instability, and computational demands [21][22] - It discusses the risk of catastrophic forgetting and model collapse when directly applying RL algorithms to large VLA models [22][24] Group 3: iRe-VLA Model and Its Architecture - The iRe-VLA model features a two-phase iterative learning process, combining exploration through online reinforcement learning and consolidation through supervised learning [17][24] - The architecture consists of a VLM (Vision-Language Model) for understanding and an Action Head for executing actions, utilizing techniques like LoRA for efficient training [19][20] Group 4: Experimental Results and Analysis - Experiments conducted in both simulated and real-world environments demonstrate the effectiveness of the iRe-VLA approach, showing significant improvements in task success rates [44][48] - The iRe-VLA model outperformed traditional methods, achieving a success rate increase from 43% to 83% in benchmark tasks [48][50] Group 5: Conclusion and Future Directions - The article concludes that the iRe-VLA framework provides a viable solution for deploying large models in robotic control, addressing challenges in stability, efficiency, and continuous learning [60][62] - It suggests that there are numerous research opportunities ahead, particularly in efficient exploration and scalable RL algorithms for VLA [62][63]
基于Qwen3-VL的自动驾驶场景实测......
自动驾驶之心· 2025-12-12 07:35
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 近年来,多模态大模型在自动驾驶领域的潜力逐渐显现。它们能否真正"看懂"路况、理解交通行为、甚至预测 风险,成为行业内外关注的焦点。 笔者对近期阿里通义最新的 Qwen3-VL 模型进行了一系列自动驾驶场景的实测,涵盖 场景理解、空间推理、 行为判断、风险预测 等多个维度。 个人认为, Qwen3-VL不仅在基础感知任务上表现稳健,更在开放式推理与动态场景理解中展现出令人惊喜 的"老司机"潜质 。 更重要的是, 它并未经过专门的自动驾驶指令微调(SFT) ,却能对复杂交通场景做出合理、连贯、甚至带 有"安全意识"的判断——这让我们看到了通用视觉语言模型在垂直领域中落地的更多可能。 本次测试选取了 CoVLA 基准中的部分图像,以及基准中的一些中翻后的问题。此外笔者也自拟了一些开放式 问题。 一起来看看吧!更多关于自动驾驶的技术解析、行业动态和业内交流, 欢迎加入自动驾驶之心知识星球,超过4000的人自驾社区...... 场景理解和空间推理 示例1 :简单描述一下这张图片。 :图片中的天气如何? :车辆正行驶在哪 ...
自动驾驶之心论文辅导推出了(端到端/OCC/BEV/VLA等方向)
自动驾驶之心· 2025-12-12 07:35
如果您有任意论文发表需求,支持带课题/研究方向咨询,欢迎联系我们, 微信:paperguidance 提供的服务 中稿率很高哦! 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 自动驾驶之心论文辅导正式推出,国内最专业的师资来啦! 如果您是端到端、VLA、世界模型、强化学习、3D目标检测、多传感器融合、3DGS、BEV感知、Occupancy Network、多任务学习、语义分割、轨迹预测、运动规划、扩散模型、Flow matching、点云感知、毫米波雷 达、单目感知、车道线/在线高精地图等方向。 论文选题; 论文全流程指导; 实验指导; 申博指导; 论文中已有多篇被CVPR、AAAI、ECCV、CoRL、ICLR、IROS、ICRA、ACL等顶会顶刊收录。 根据不同论文级别,辅导价格不同,具体如下: 自动驾驶顶会/顶刊,CCF-A、CCF-B、CCF-C等; SCI一区~四区; 中科院1区,2区,3区,4区; EI/中文核心; 毕设论文/申博/比赛等; 更多咨询 更多论文辅导内容,欢迎咨询科研助理, 微信:paperguidance ...
正式开课!7个Project搞懂端到端落地现状
自动驾驶之心· 2025-12-12 03:02
Core Insights - The article discusses the evolving recruitment landscape in the autonomous driving industry, highlighting a shift in demand from perception roles to end-to-end, VLA, and world model positions [2] - A new advanced course focused on end-to-end production in autonomous driving has been designed, emphasizing practical applications and real-world experience [2][4] Course Overview - The course is structured into eight chapters, covering various aspects of end-to-end algorithms, including task overview, two-stage and one-stage frameworks, navigation information applications, reinforcement learning, trajectory optimization, and production experience sharing [5][7][8][9][10][11][12][13][14] - The first chapter introduces the integration of perception tasks and learning-based control algorithms, which are essential skills for companies in the end-to-end era [7] - The second chapter focuses on the two-stage end-to-end algorithm framework, discussing its modeling and information transfer between perception and planning [8] - The third chapter covers one-stage end-to-end algorithms, emphasizing their performance advantages and various frameworks [9] - The fourth chapter highlights the critical role of navigation information in autonomous driving and its integration into end-to-end models [10] - The fifth chapter introduces reinforcement learning algorithms, addressing the limitations of imitation learning and the need for generalization [11] - The sixth chapter involves practical projects on trajectory output optimization, combining imitation and reinforcement learning [12] - The seventh chapter discusses post-processing logic for trajectory smoothing and reliability in production [13] - The final chapter shares production experiences from multiple perspectives, focusing on tools and strategies for real-world applications [14] Target Audience - The course is aimed at advanced learners with a foundational understanding of autonomous driving algorithms, reinforcement learning, and programming skills [15][17]
2025年的博世,正在脱胎换骨......
自动驾驶之心· 2025-12-12 03:02
Core Viewpoint - Bosch, a leading international Tier 1 supplier, is heavily investing in both research and production lines for autonomous driving, aiming to keep pace with the rapid development in the domestic smart driving sector [2]. Group 1: Bosch's Investment and Research Directions - Bosch is focusing on two main areas: production and research, with significant resources allocated to end-to-end solutions [2]. - The company has recently recruited numerous technical experts to enhance its capabilities in autonomous driving [2]. - Bosch's research includes notable algorithms and projects, such as DGS and FlowDrive, which are aimed at improving autonomous driving technologies [2][4]. Group 2: Key Research Contributions - DGS (Dense Depth Regularization for LiDAR-free Urban Scene Reconstruction) is a framework that achieves high-quality geometric reconstruction and depth estimation without LiDAR, significantly reducing complexity and costs [5]. - FlowDrive introduces an energy flow field for end-to-end autonomous driving, enhancing safety and interpretability in trajectory generation [9]. - AnchDrive combines dynamic and static trajectory anchors to improve the efficiency and performance of trajectory generation in autonomous driving [13]. Group 3: Advanced Techniques and Frameworks - DiffSemanticFusion enhances the stability and semantic richness of online HD maps, improving trajectory prediction and planning tasks [16]. - IRL-VLA presents a closed-loop reinforcement learning framework that optimizes driving strategies without relying on high-fidelity simulators, achieving advanced performance metrics [19]. - SparseMeXT redefines online HD map construction using sparse representations, achieving superior accuracy and efficiency compared to existing methods [21]. Group 4: Data and Model Innovations - The Impromptu VLA dataset addresses performance issues in unstructured driving scenarios, providing a large-scale, high-quality resource for training VLA models [23]. - DiffVLA integrates vision-language models for enhanced decision-making in complex driving environments, demonstrating improved robustness and generalization [25]. - DINO-R1 introduces a novel training strategy for visual foundation models, significantly enhancing reasoning capabilities in visual prompt detection [27].
人民大学提出的扩散语言模型,可能要改写历史...
自动驾驶之心· 2025-12-12 03:02
Core Viewpoint - The article discusses the development and future of diffusion language models, highlighting two main phases: foundational research (2022-2024) and scaling (2024-2025) [3][14]. Phase 1: Foundational Research (2022-2024) - Diffusion language models are initially niche, with a focus on both continuous and discrete models [4][5]. - Continuous diffusion models have been applied to discrete data, with notable works including those by Percy Liang and Alex Graves [6]. - A significant method proposed at ICML 2024 unifies Bayesian flow networks and diffusion models without requiring data continuousization [7]. - Discrete diffusion models have evolved since their introduction in 2015, with modern iterations like D3PM and SEDD improving optimization loss functions [8]. - The relationship between MDM (Masked Diffusion Model) and BERT is explored, emphasizing the technical distinctions and the generative nature of diffusion models [11][12]. Phase 2: Scaling (2024-2025) - The research group aims to focus on MDM projects, ensuring each member has a significant contribution [15]. - The first scaling law for MDM is set to be presented at ICLR 2025, demonstrating that MDM can match autoregressive models in performance [16]. - The LLaDA model, capable of multi-turn dialogue, shows promising scalability and instruction-following abilities, comparable to LLaMA 3 [16]. - The industrial response to LLaDA includes rapid developments like Mercury coder and Gemini Diffusion, although these products are not directly influenced by the academic work [19]. - LLaDA is recognized as a significant contribution to the field, enhancing understanding of generative models despite criticisms regarding its novelty [21].
全部超越π0、π0.5!端到端全身VLA模型Lumo-1
自动驾驶之心· 2025-12-12 03:02
Core Insights - The article discusses the advancements in robotics, particularly focusing on the Lumo-1 model developed by Stardust Intelligence, which aims to enhance robots' reasoning and action capabilities, allowing them to perform complex tasks without explicit programming [9][11][12]. Group 1: Lumo-1 Model Overview - Lumo-1 is an end-to-end VLA model designed to enable robots to understand and execute tasks through reasoning, rather than just mimicking actions [9]. - The model demonstrates superior operational intelligence and generalization capabilities, outperforming previous models like π0 and π0.5 in multi-step tasks and handling unseen objects and instructions [11][13]. Group 2: Training Phases - The training of Lumo-1 consists of three stages: 1. Embodied VLM pre-training on visual-language data to develop spatial understanding and trajectory inference [17]. 2. Cross-domain joint training to enhance instruction following and spatial reasoning [18]. 3. Real-world reasoning-action training using the Astribot S1 robot to learn executable action patterns [18][20]. Group 3: Technical Innovations - Lumo-1 employs a Spatial Action Tokenizer (SAT) to model action spaces, allowing for the combination and reuse of actions in a structured manner [21]. - The model integrates structured reasoning to form a chain of explanations for actions, enabling it to understand the "why" behind tasks before executing the "how" [25]. Group 4: Performance and Validation - Lumo-1 has shown significant improvements in various multimodal benchmarks, outperforming specialized models like RoboBrain-7B and Robix-7B [31]. - The model's ability to adapt to different environments and instructions demonstrates its robust generalization capabilities, such as adjusting arm positions for varying container heights [31]. Group 5: Implications for the Industry - The findings suggest that data diversity in training is more impactful for generalization than merely increasing data volume, indicating a shift in focus towards data quality [30]. - The advancements in Lumo-1 highlight the potential for robots to perform complex tasks autonomously, which could revolutionize industries reliant on automation and robotics [9][11].
南大一篇84页的统一多模态理解和生成综述......
自动驾驶之心· 2025-12-11 03:35
Core Insights - The article discusses the evolution and significance of Unified Foundation Models (UFM) in the realm of AI, particularly focusing on the integration of understanding and generation capabilities across multiple modalities [1][3][41] - A comprehensive survey titled "A Survey of Unified Multimodal Understanding and Generation: Advances and Challenges" has been published, providing a systematic framework for UFM research, including architecture classification, technical details, training processes, and practical applications [1][4][41] Group 1: Importance of Unified Multimodal Models - The necessity of combining understanding and generation into a single model is emphasized, as it allows for more complex and coherent task execution [3][4] - Current open-source UFMs, while competitive in some tasks, still lag behind proprietary models like GPT-4o and Gemini 2.0 Flash, highlighting the need for a unified approach to overcome fragmentation in the open-source community [4][6] Group 2: Evolution of Unified Foundation Models - The evolution of UFM is categorized into three distinct stages: 1. **Isolation Stage**: Understanding and generation are handled by separate models [6] 2. **Combination Stage**: Understanding and generation modules are integrated within a single framework [7] 3. **Emergent Stage**: The ultimate goal where models can seamlessly switch between understanding and generation, akin to human cognitive processes [8][9] Group 3: Architectural Framework of UFM - The article categorizes UFM architectures into three main types based on the coupling of understanding and generation modules: 1. **External Service Integration**: LLMs act as task coordinators, calling external models for specific tasks [12][13] 2. **Modular Joint Modeling**: LLMs connect understanding and generation tasks through intermediary layers [14][15] 3. **End-to-End Unified Modeling**: A single architecture handles both understanding and generation tasks, representing the highest level of integration [20][21] Group 4: Technical Details of UFM - The technical aspects of UFM are broken down into encoding, decoding, and training processes, with detailed methodologies provided for each [22][32] - Encoding strategies include continuous, discrete, and hybrid approaches to convert multimodal data into a format suitable for model processing [27][30] - Decoding processes are designed to transform model outputs back into human-readable formats, utilizing various techniques to enhance quality and efficiency [28][31] Group 5: Applications and Future Directions - UFM applications span multiple fields, including robotics, autonomous driving, world modeling, and medical imaging, with specific use cases outlined for each domain [39][42] - Future research directions focus on improving modeling architectures, developing unified tokenizers, refining training strategies, and establishing benchmark tests to evaluate understanding and generation synergy [40][42]
时隔一年DiffusionDrive升级到v2,创下了新纪录!
自动驾驶之心· 2025-12-11 03:35
Core Insights - The article discusses the upgrade of DiffusionDrive to version 2, highlighting its advancements in end-to-end autonomous driving trajectory planning through the integration of reinforcement learning to address the challenges of diversity and sustained high quality in trajectory generation [1][3][10]. Background Review - The shift towards end-to-end autonomous driving (E2E-AD) has emerged as traditional tasks like 3D object detection and motion prediction have matured. Early methods faced limitations in modeling, often generating single trajectories without alternatives in complex driving scenarios [5][10]. - Previous diffusion models applied to trajectory generation struggled with mode collapse, leading to a lack of diversity in generated behaviors. DiffusionDrive introduced a Gaussian Mixture Model (GMM) to define prior distributions for initial noise, promoting diverse behavior generation [5][13]. Methodology - DiffusionDriveV2 introduces a novel framework that utilizes reinforcement learning to overcome the limitations of imitation learning, which previously led to a trade-off between diversity and sustained high quality in trajectory generation [10][12]. - The framework incorporates intra-anchor GRPO and inter-anchor truncated GRPO to manage advantage estimation within specific driving intentions, preventing mode collapse by avoiding inappropriate comparisons between different intentions [9][12][28]. - The method employs scale-adaptive multiplicative noise to enhance exploration while maintaining trajectory smoothness, addressing the inherent scale inconsistency between proximal and distal segments of trajectories [24][39]. Experimental Results - Evaluations on the NAVSIM v1 and NAVSIM v2 datasets demonstrated that DiffusionDriveV2 achieved state-of-the-art performance, with a PDMS score of 91.2 on NAVSIM v1 and 85.5 on NAVSIM v2, significantly outperforming previous models [10][33]. - The results indicate that DiffusionDriveV2 effectively balances trajectory diversity and sustained quality, achieving optimal performance in closed-loop evaluations [38][39]. Conclusion - The article concludes that DiffusionDriveV2 successfully addresses the inherent challenges of imitation learning in trajectory generation, achieving an optimal trade-off between planning quality and diversity through innovative reinforcement learning techniques [47].
世界模型和VLA正在逐渐走向融合统一
自动驾驶之心· 2025-12-11 03:35
Core Viewpoint - The integration of Vision-Language Action (VLA) and World Model (WM) technologies is becoming increasingly evident, suggesting a trend towards unification rather than opposition in the field of autonomous driving [3][5][7]. Group 1: Technology Trends - VLA and WM are seen as complementary technologies, with VLA focusing on abstract reasoning and WM on physical perception, both essential for achieving advanced General Artificial Intelligence (AGI) [4]. - Recent academic explorations have demonstrated the feasibility of combining VLA and WM, with notable projects like DriveVLA-W0 showcasing successful joint training [4]. - The future training pipeline for Level 4 (L4) autonomous systems is expected to incorporate VLA, Reinforcement Learning (RL), and WM, indicating the necessity of all three components [5]. Group 2: Community and Learning Resources - The "Autonomous Driving Heart Knowledge Planet" community has been established to provide a comprehensive platform for learning and sharing knowledge in the autonomous driving sector, with over 4,000 members and plans to expand to nearly 10,000 [10][28]. - The community offers a variety of resources, including video content, learning routes, and Q&A sessions, aimed at both beginners and advanced practitioners in the field [10][12]. - A detailed compilation of over 40 technical routes and numerous datasets related to autonomous driving is available, facilitating quicker access to essential information for newcomers and experienced professionals alike [29][48]. Group 3: Job Opportunities and Networking - The community has established a job referral mechanism with various autonomous driving companies, allowing members to connect with potential employers easily [22]. - Regular discussions and insights from industry leaders are part of the community's offerings, providing members with valuable perspectives on career development and industry trends [14][107].