Workflow
自动驾驶之心
icon
Search documents
对VLA的RL最新进展的梳理~
自动驾驶之心· 2025-07-03 12:41
Core Viewpoint - The article discusses the recent advancements in Vision-Language-Action (VLA) models, particularly focusing on the integration of Reinforcement Learning (RL) techniques to enhance their performance and stability in various tasks [1]. Group 1: Early Exploration of iRe-VLA - The core algorithm of iRe-VLA is PPO, which introduces a two-stage training paradigm to address instability in online reinforcement learning [2]. - The implementation utilizes BLIP-2 3B as the VLM backbone, replacing the final fully connected layer with an action head that includes a token learner and an MLP [2]. - The experimental setup involves simulation environments like Meatworld and Franka Kitchen, with tasks divided into three categories for evaluation [2]. Group 2: Preference Alignment with GRAPE - GRAPE introduces preference alignment into VLA training, specifically designed for VLA characteristics [6]. - The reward for each trajectory is composed of three parts: success reward, self-reward, and external reward based on a custom cost function [8]. - The external reward is calculated by decomposing trajectories into stages and evaluating them using a VLM task decomposer [9]. Group 3: LOOP and RIPT-VLA - LOOP combines RLOO and PPO to address challenges in sparse rewards and long sequences in multi-task scenarios [11]. - The RIPT-VLA employs the LOOP algorithm for online RL and provides open-source code for implementation [13]. - The approach includes various tricks to enhance training efficiency, such as dynamic rejection mechanisms and multi-task sampling [15]. Group 4: System and Algorithm Innovations in RL4VLA - RL4VLA models the action generation process as a multi-modal dialogue, using PPO training with dense pseudo-rewards to guide the training process [18]. - The training involves a Robotic Process Reward Model that predicts the likelihood of action sequences, enhancing the reward signal [20]. - The article emphasizes adaptive curriculum selection strategies to improve sample efficiency and generalization capabilities [21][23]. Group 5: Engineering Challenges and Future Directions - The article highlights the need for new RL algorithms suitable for VLA-RL, particularly addressing sparse reward issues and enhancing sample efficiency [30]. - It points out the engineering challenges in improving sampling efficiency and managing memory costs in VLA scenarios [30]. - The exploration of effective reward design and the implementation of RL in non-autoregressive VLA structures are identified as critical areas for future research [30].
你被哪个后来知道很致命的BUG困扰过一周以上吗?
自动驾驶之心· 2025-07-03 12:41
Core Insights - The article discusses the challenges and experiences in training AI models using reinforcement learning, highlighting the importance of reward design and the pitfalls that can arise during the process [1][2]. Group 1: Reinforcement Learning Challenges - The author shares experiences from a project where a robot was trained to run, illustrating how different reward structures led to unexpected behaviors, such as jumping too far and falling [1]. - The design of learning objectives is crucial, as poorly defined goals can lead to models that do not perform as intended, such as generating repetitive outputs or failing to learn effectively [2]. Group 2: AI Model Training Insights - The robustness of neural networks allows them to continue iterating despite bugs in the code, which can lead to unexpected improvements when the bugs are eventually removed [2]. - The article emphasizes the collaborative nature of deep learning projects, where introducing bugs can inspire creative solutions from team members [2]. Group 3: Community and Learning Resources - The article mentions a community of nearly 4,000 members, including over 300 companies and research institutions in the autonomous driving sector, providing a platform for learning and sharing knowledge [3]. - Various technical areas related to autonomous driving are covered, including perception, mapping, and control, indicating a comprehensive approach to education in this field [3].
自动驾驶论文速递 | ICCV最新论文、端到端、高精地图、世界模型等~
自动驾驶之心· 2025-07-03 11:53
Core Insights - The article discusses advancements in autonomous driving frameworks, specifically highlighting the World4Drive, SafeMap, TopoStreamer, and BEV-VAE models, which improve various aspects of autonomous driving technology. Group 1: World4Drive Framework - The World4Drive framework, developed by CASIA and Li Auto, integrates spatial semantic priors and multimodal driving intention modeling, achieving an 18.1% reduction in L2 error (0.61m to 0.50m) and a 46.7% decrease in collision rate (0.30% to 0.16%) [2][3] - It introduces an intention-aware latent world model that simulates the evolution of the physical world under different driving intentions, closely aligning with human decision-making logic [3] - The framework demonstrates state-of-the-art planning performance without the need for perception annotations, with a training convergence speed improvement of 3.75 times [3] Group 2: SafeMap Framework - The SafeMap framework, proposed by Tsinghua University and others, utilizes dynamic Gaussian sampling and panoramic feature distillation to construct robust high-definition maps from incomplete observations, achieving an 11.1% improvement in mAP when key views are missing [9][10] - It features two innovative modules: G-PVR for perspective view reconstruction and D-BEVC for correcting bird's-eye view features, ensuring high accuracy in map construction even with missing camera views [10] - Experimental results show SafeMap significantly outperforms existing methods, providing a plug-and-play solution for enhancing map robustness [10] Group 3: TopoStreamer Model - The TopoStreamer model, developed by CUHK and Tencent, addresses the temporal consistency challenges in lane topology reasoning, achieving a 3.4% improvement in lane segment perception mAP (reaching 36.6%) and a 2.1% increase in centerline perception OLS (reaching 44.4%) [18][21] - It introduces three innovative modules to ensure temporal consistency in lane attributes and improve feature representation learning [21] - TopoStreamer achieves state-of-the-art performance in lane segment topology reasoning on the OpenLane-V2 benchmark dataset [21] Group 4: BEV-VAE Framework - The BEV-VAE framework, proposed by Shanghai QiZhi Institute and Tsinghua University, constructs a bird's-eye view latent space for multi-view image generation and precise 3D layout control, achieving a spatial consistency metric (MVSC) of 0.9505 on the Argoverse 2 dataset [29][31] - It supports new view synthesis by adjusting camera poses and demonstrates strong cross-view consistency [34] - The framework allows for controllable synthesis based on 3D object layouts, enhancing the capabilities of autonomous driving scene understanding [34]
清华最新RoboScape:基于物理信息的具身世界模型~
自动驾驶之心· 2025-07-03 06:34
编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 以下文章来源于具身智能之心 ,作者Yu Shang等 具身智能之心 . 与世界交互,更进一步 点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Yu Shang等 研究背景与核心问题 在具身智能领域,世界模型作为强大的模拟器,能生成逼真的机器人视频并缓解数据稀缺问题,但现有模 型在物理感知上存在显著局限。尤其在涉及接触的机器人场景中,因缺乏对3D几何和运动动力学的建模能 力,生成的视频常出现不真实的物体变形或运动不连续等问题,这在布料等可变形物体的操作任务中尤为 突出。 根源在于现有模型过度依赖视觉令牌拟合,缺乏物理知识 awareness。此前整合物理知识的尝试分为三类: 物理先验正则化(局限于人类运动或刚体动力学等窄域)、基于物理模拟器的知识蒸馏(级联 pipeline 计 算复杂)、材料场建模(限于物体级建模,难用于场景级生成)。因此,如何在统一、高效的框架中整合 物理知识,成为亟 ...
博士毕业,五篇顶会起步。。。
自动驾驶之心· 2025-07-03 06:34
Core Viewpoint - The article emphasizes the importance of timely submission and high-quality research papers for academic success, particularly in the field of autonomous driving and AI research, while offering a structured 1v1 guidance program to help researchers navigate the complexities of the research and publication process [2][3]. Group 1: Pain Points Addressed - The program addresses the lack of guidance and structured support for researchers, particularly those who are left to navigate their research independently [6]. - It aims to help students establish a clear research framework and improve their practical skills by integrating theoretical models with coding practices [6][13]. - The service is designed for computer science students at various academic levels who seek to enhance their research capabilities and academic achievements [6][13]. Group 2: Course Content - The 1v1 research paper guidance covers multiple stages, including topic selection, experimental design, writing, and submission [5][9][11][12]. - In the topic selection phase, mentors assist students in brainstorming ideas or providing direct suggestions based on their needs [7]. - During the experimental phase, mentors guide students through the entire process, ensuring the feasibility and quality of their experiments [9][14]. - The writing phase focuses on helping students craft compelling research papers that meet high standards [11][15]. - In the submission phase, mentors recommend suitable journals and assist with the submission process [12][16]. Group 3: Course Outcomes - Participants can expect to produce high-quality papers tailored to their target publication venues [23]. - The program enhances participants' understanding of the research process, writing techniques, and publication strategies [23][24]. - Students will gain insights into cutting-edge technologies and research trends in their fields [23][24]. Group 4: Course Structure and Duration - The total guidance period varies from 3 to 18 months, depending on the target publication level [24]. - The core guidance period includes weekly 1-on-1 sessions, while the maintenance period provides ongoing support after paper submission [26]. - Specific course hours are allocated based on the publication tier, with varying numbers of sessions for different categories [24].
咬牙坚持了半年,上岸小厂心满意足了。。。
自动驾驶之心· 2025-07-02 13:54
Core Viewpoint - The article discusses the advancements in AI technology, particularly in autonomous driving and embodied intelligence, highlighting the saturation of the autonomous driving industry and the challenges faced by job seekers in this field [2]. Group 1: Industry Overview - The autonomous driving sector has seen significant breakthroughs, with L2 to L4 functionalities being mass-produced, alongside developments in humanoid robots and quadrupedal robots [2]. - The industry is experiencing a high demand for technology and talent, as evidenced by the establishment of a job-seeking community called AutoRobo, which focuses on autonomous driving, embodied intelligence, and robotics [2][3]. Group 2: Community and Resources - AutoRobo knowledge community has nearly 1000 members, including professionals from companies like Horizon Robotics, Li Auto, Huawei, and Xiaomi, as well as students preparing for upcoming job fairs [2][4]. - The community provides resources such as interview questions, industry reports, salary negotiation tips, and job referrals, aimed at helping members navigate their job search effectively [3][4]. Group 3: Interview Preparation - The community has compiled a comprehensive list of interview questions across various topics related to autonomous driving and embodied intelligence, including algorithms, development, and product roles [9][10][11]. - Specific areas covered include multi-sensor fusion, perception algorithms, and decision-making processes, providing members with practical insights for their job applications [10][14]. Group 4: Industry Reports and Insights - The community offers access to industry reports that detail the current state, development trends, and market opportunities within the autonomous driving and embodied intelligence sectors [15][19]. - Reports include insights into trajectory prediction, occupancy perception, and the overall landscape of the humanoid robotics market, helping members understand the industry's dynamics [15][19].
今年,传统规划控制怎么找工作?
自动驾驶之心· 2025-07-02 13:54
Core Viewpoint - The article emphasizes the evolving landscape of autonomous driving, highlighting the integration of traditional planning and control with end-to-end systems, and the importance of adapting to industry trends for job seekers in this field [2][4][29]. Group 1: Industry Trends - The shift towards end-to-end and VLA (Vision-Language Alignment) systems is impacting traditional planning and control roles, which are still essential for safety-critical applications like L4 autonomous driving [2][4][29]. - There is a growing emphasis on combining rule-based algorithms with end-to-end approaches in job interviews, indicating a need for candidates to be proficient in both areas [3][4]. Group 2: Educational Offerings - The company has launched specialized courses aimed at addressing real-world challenges in autonomous driving planning and control, focusing on practical applications and interview preparation [5][7][10]. - The courses are designed to provide hands-on experience with industry-relevant projects, enhancing participants' resumes and job prospects [8][10][12]. Group 3: Course Structure - The curriculum covers foundational algorithms, decision-making frameworks, and advanced topics such as contingency planning and interactive planning, ensuring a comprehensive understanding of the field [20][21][24][26][29]. - The course also includes interview coaching, resume enhancement, and personalized guidance from industry experts, aimed at increasing participants' employability [31][34][36]. Group 4: Target Audience - The courses are tailored for individuals with a background in vehicle engineering, automation, computer science, and related fields, as well as those looking to transition into autonomous driving roles [37][39]. - Participants are expected to have a basic understanding of programming and relevant mathematical concepts to fully benefit from the training [38][39].
全球首个自动驾驶VLA综述重磅发布:VLA自驾模型全面拆解(麦吉尔&清华等)
自动驾驶之心· 2025-07-02 13:54
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 今天自动驾驶之心为大家分享 麦吉尔大学、清华大学、小米公司 和威斯康辛麦迪 逊的研究团队 最新的工作! 面向自动驾驶的视觉-语言-动作模型综述! 如果您有 相关工作需要分享,请在文末联系我们! 自动驾驶课程学习与技术交流群事宜,也欢迎添加小助理微信AIDriver004做进一 步咨询 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 论文作者 | Sicong Jiang等 编辑 | 自动驾驶之心 "自动驾驶未来已来?" 当视觉(Vision)、语言(Language)和行动(Action)三大能力在一个模型中融合,自动驾驶的未来将走 向何方? 近日,来自麦吉尔大学、清华大学、小米公司和威斯康辛麦迪逊的研究团队联合发布了全球首篇针对自动 驾驶领域的视觉-语言-行动(Vision-Language-Action, VLA)模型的全面综述。这篇题为《A Survey on Vision-Language-Action Models for Autonomous Driving 》 的 论 文 , 系 统 性 地 ...
实验室10篇论文被ICCV 2025录用
自动驾驶之心· 2025-07-02 13:54
Core Insights - The article discusses the acceptance of 10 papers from a laboratory at the 20th ICCV International Conference on Computer Vision, highlighting advancements in 3D vision and related technologies [25]. Paper Summaries Paper 1: Domain-aware Category-level Geometry Learning Segmentation for 3D Point Clouds - This paper addresses domain generalization in 3D scene segmentation, proposing a framework that couples geometric embedding with semantic learning to enhance model generalization [1]. Paper 2: Hierarchical Variational Test-Time Prompt Generation for Zero-Shot Generalization - The authors introduce a hierarchical variational method for dynamic prompt generation during inference, significantly improving the zero-shot generalization capabilities of visual language models [3]. Paper 3: Knowledge-Guided Part Segmentation - A new framework is proposed that utilizes structural knowledge to enhance the segmentation of fine-grained object parts, improving understanding of complex structures [5][6]. Paper 4: TopicGeo: An Efficient Unified Framework for Geolocation - TopicGeo presents a unified framework for geolocation that improves computational efficiency and accuracy by directly matching query images with reference images [9]. Paper 5: Vision-Language Interactive Relation Mining for Open-Vocabulary Scene Graph Generation - This paper explores a model that enhances the understanding of relationships in open-vocabulary scene graph generation through multimodal interaction learning [11]. Paper 6: VGMamba: Attribute-to-Location Clue Reasoning for Quantity-Agnostic 3D Visual Grounding - The authors propose a mechanism that combines attribute and spatial information to improve the accuracy of 3D visual grounding tasks [13]. Paper 7: Meta-Learning Dynamic Center Distance: Hard Sample Mining for Learning with Noisy Labels - A new metric called Dynamic Center Distance is introduced to enhance the learning process in the presence of noisy labels by focusing on hard samples [15]. Paper 8: Learning Separable Fine-Grained Representation via Dendrogram Construction from Coarse Labels for Fine-grained Visual Recognition - The paper presents a method for learning fine-grained representations from coarse labels without predefined category numbers, enhancing adaptability to dynamic semantic structures [17]. Paper 9: Category-Specific Selective Feature Enhancement for Long-Tailed Multi-Label Image Classification - This research addresses the issue of label imbalance in multi-label image classification by enhancing feature sensitivity for underrepresented categories [19]. Paper 10: Partially Matching Submap Helps: Uncertainty Modeling and Propagation for Text to Point Cloud Localization - The authors redefine the task of text to point cloud localization by allowing partial spatial matches, improving the model's ability to handle real-world ambiguities [21].
自动驾驶论文速递 | 世界模型、VLA综述、端到端等
自动驾驶之心· 2025-07-02 07:34
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 世界模型Epona 地平线、清华、北大等团队ICCV'25中稿的自回归扩散世界模型工作,同时可以不依赖视频预测独立输出轨 迹规划。 主要贡献: 论文标题:Epona: Autoregressive Diffusion World Model for Autonomous Driving 论文链接:https://arxiv.org/abs/2506.24113 项目主页:https://kevin-thu.github.io/Epona/ 长时序生成。Epona可以实现长达2分钟的长时间生成,显著优于现有的世界模型; 实时轨迹规划。独立的多模态生成架构能够在视频预测不可用的情况下独立输出轨迹规划,从而显著降 低了推理FLOPS。这实现了高质量甚至实时的轨迹规划,高达20Hz的帧率; 视觉细节的保存。Epona的自回归公式采用连续视觉标记器而不是离散标记器,从而保留了丰富的场景 细节; 可视化: 算法框架: 实验结果: | Metric | | | | DriveGAN [30] DriveDreamer [5 ...