Workflow
自动驾驶之心
icon
Search documents
全球首个自动驾驶VLA综述重磅发布:VLA自驾模型全面拆解(麦吉尔&清华等)
自动驾驶之心· 2025-07-02 13:54
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 今天自动驾驶之心为大家分享 麦吉尔大学、清华大学、小米公司 和威斯康辛麦迪 逊的研究团队 最新的工作! 面向自动驾驶的视觉-语言-动作模型综述! 如果您有 相关工作需要分享,请在文末联系我们! 自动驾驶课程学习与技术交流群事宜,也欢迎添加小助理微信AIDriver004做进一 步咨询 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 论文作者 | Sicong Jiang等 编辑 | 自动驾驶之心 "自动驾驶未来已来?" 当视觉(Vision)、语言(Language)和行动(Action)三大能力在一个模型中融合,自动驾驶的未来将走 向何方? 近日,来自麦吉尔大学、清华大学、小米公司和威斯康辛麦迪逊的研究团队联合发布了全球首篇针对自动 驾驶领域的视觉-语言-行动(Vision-Language-Action, VLA)模型的全面综述。这篇题为《A Survey on Vision-Language-Action Models for Autonomous Driving 》 的 论 文 , 系 统 性 地 ...
实验室10篇论文被ICCV 2025录用
自动驾驶之心· 2025-07-02 13:54
Core Insights - The article discusses the acceptance of 10 papers from a laboratory at the 20th ICCV International Conference on Computer Vision, highlighting advancements in 3D vision and related technologies [25]. Paper Summaries Paper 1: Domain-aware Category-level Geometry Learning Segmentation for 3D Point Clouds - This paper addresses domain generalization in 3D scene segmentation, proposing a framework that couples geometric embedding with semantic learning to enhance model generalization [1]. Paper 2: Hierarchical Variational Test-Time Prompt Generation for Zero-Shot Generalization - The authors introduce a hierarchical variational method for dynamic prompt generation during inference, significantly improving the zero-shot generalization capabilities of visual language models [3]. Paper 3: Knowledge-Guided Part Segmentation - A new framework is proposed that utilizes structural knowledge to enhance the segmentation of fine-grained object parts, improving understanding of complex structures [5][6]. Paper 4: TopicGeo: An Efficient Unified Framework for Geolocation - TopicGeo presents a unified framework for geolocation that improves computational efficiency and accuracy by directly matching query images with reference images [9]. Paper 5: Vision-Language Interactive Relation Mining for Open-Vocabulary Scene Graph Generation - This paper explores a model that enhances the understanding of relationships in open-vocabulary scene graph generation through multimodal interaction learning [11]. Paper 6: VGMamba: Attribute-to-Location Clue Reasoning for Quantity-Agnostic 3D Visual Grounding - The authors propose a mechanism that combines attribute and spatial information to improve the accuracy of 3D visual grounding tasks [13]. Paper 7: Meta-Learning Dynamic Center Distance: Hard Sample Mining for Learning with Noisy Labels - A new metric called Dynamic Center Distance is introduced to enhance the learning process in the presence of noisy labels by focusing on hard samples [15]. Paper 8: Learning Separable Fine-Grained Representation via Dendrogram Construction from Coarse Labels for Fine-grained Visual Recognition - The paper presents a method for learning fine-grained representations from coarse labels without predefined category numbers, enhancing adaptability to dynamic semantic structures [17]. Paper 9: Category-Specific Selective Feature Enhancement for Long-Tailed Multi-Label Image Classification - This research addresses the issue of label imbalance in multi-label image classification by enhancing feature sensitivity for underrepresented categories [19]. Paper 10: Partially Matching Submap Helps: Uncertainty Modeling and Propagation for Text to Point Cloud Localization - The authors redefine the task of text to point cloud localization by allowing partial spatial matches, improving the model's ability to handle real-world ambiguities [21].
自动驾驶论文速递 | 世界模型、VLA综述、端到端等
自动驾驶之心· 2025-07-02 07:34
Core Insights - The article discusses advancements in autonomous driving technology, particularly focusing on the Epona model, which utilizes autoregressive diffusion for trajectory planning and long-term generation [6][5]. Group 1: Epona Model - Epona can generate sequences lasting up to 2 minutes, significantly outperforming existing world models [6]. - It features a real-time trajectory planning capability that operates independently of video prediction, achieving frame rates up to 20Hz [6]. - The model employs a continuous visual marker in its autoregressive formulation, preserving rich scene details [6]. Group 2: Experimental Results - The article presents various metrics comparing Epona with other models, highlighting its superior performance in FID and FVD metrics [5]. - Epona achieved a FID score of 7.5 and a FVD score of 82.8, indicating its effectiveness in generating high-quality driving scenarios [5]. Group 3: Vision-Language-Action Models - A survey on Vision-Language-Action models for autonomous driving is also discussed, showcasing various models and their capabilities [15][18]. - The models listed include DriveGPT-4, ADriver-I, and RAG-Driver, each with unique features and datasets [18]. Group 4: StyleDrive Benchmarking - The article introduces StyleDrive, which aims to benchmark end-to-end autonomous driving with a focus on driving style awareness [21]. - It outlines rule-based heuristic criteria for driving style classification across various traffic scenarios [22]. Group 5: Community Engagement - The article encourages joining a knowledge-sharing community focused on autonomous driving, offering resources and networking opportunities [9][25]. - The community aims to build a comprehensive platform for learning and sharing the latest industry trends and job opportunities [25].
从感知能力提升到轻量化落地,具身这条路还要走很长一段时间~
自动驾驶之心· 2025-07-02 02:05
Core Viewpoint - The embodied intelligence industry is expected to experience explosive growth by 2025, driven by technological advancements and application traction, shaping both the technical roadmap and commercialization pathways [1]. Group 1: Technological Developments - Upgrades in perception capabilities and multimodal integration are crucial for the development of embodied technologies, with a focus on tactile perception, particularly in dexterous hands, enhancing precision and feedback [1]. - Multimodal sensor fusion technology allows robots to process various types of information simultaneously, significantly improving environmental perception accuracy and comprehensiveness [1]. - Large model-driven algorithms are enhancing robots' understanding of the world, particularly in humanoid robots, by improving perception, autonomous learning, and decision-making capabilities [1]. - Lightweight model design is becoming a pressing need for industry implementation, requiring low-computation, multimodal, and cross-platform models [1]. Group 2: Simulation and Data Ecosystem - The continuous improvement of simulation environments and data ecosystems is vital for embodied intelligence, providing efficient training platforms for robots [1]. - Simulations based on physical world principles help in modeling and analyzing various phenomena, aiding robots in understanding physical interactions and operations [1]. - The alignment of simulation and real-world environments is a key challenge that researchers are working to overcome [1]. Group 3: Community and Resources - The "Embodied Intelligence Heart Knowledge Planet" serves as a technical exchange platform for various stakeholders in the field, including members from renowned universities and leading robotics companies [6]. - The community has compiled over 40 open-source projects and nearly 60 datasets related to embodied intelligence, along with mainstream simulation platforms and various learning pathways [6][12]. - Members can access a wealth of resources, including research reports, technical learning routes, and job opportunities in the embodied intelligence sector [11][14].
同样的idea别人中了CVPR,你的却被秒拒?
自动驾驶之心· 2025-07-02 02:05
Group 1 - The article emphasizes the importance of not being a "point solution," suggesting that impactful research should have broader applications beyond specific metrics [1] - It discusses the challenges in implementing research ideas, highlighting that even complex ideas can yield significant results if executed properly [2] - The article mentions the necessity of having experienced mentors in cutting-edge fields like autonomous driving and robotics to successfully publish in top conferences [2][4] Group 2 - The company offers comprehensive, hands-on support until successful publication, focusing on building a solid framework for academic writing and targeted experimental guidance [3][5] - It targets students and researchers who lack experience in publication standards, particularly those aiming for top conferences or transitioning into autonomous driving [5] - The company claims a high acceptance rate of 96% for its students over the past three years, supported by a team of over 300 dedicated instructors from top global universities [8] Group 3 - The article outlines the various fields and topics the company can assist with, including advanced AI applications and various conference categories [9][10] - It highlights the fast-paced nature of AI research, where new state-of-the-art (SOTA) developments can emerge within months, stressing the importance of timely support [11] - The company encourages potential clients to avoid wasting time and to seek assistance to expedite their research and publication processes [12]
时序融合等价梯度下降?GDFusion刷新OCC SOTA !显存大降七成~
自动驾驶之心· 2025-07-01 12:58
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 今天自动驾驶之心为大家分享 澳门大学X 武汉大学 最新的工作! 时序融合等价于 梯度下降?GDFusion 刷新 OCC 性能 SOTA,显存还大降72%! 如果您有相关工 作需要分享,请在文末联系我们! 自动驾驶课程学习与技术交流群事宜,也欢迎添加小助理微信AIDriver004做进一 步咨询 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 论文作者 | Dubing Chen等 编辑 | 自动驾驶之心 一句话总结:来自澳门大学等机构的研究者提出了一种全新的时序融合框架GDFusion。它通过一个极其巧 妙的视角——将传统RNN更新过程重新诠释为"特征空间上的梯度下降",成功统一了多种异构时序信息的 融合。GDFusion不仅在3D占用栅格预测任务上取得了1.4%-4.8%的mIoU提升,更惊人地将推理显存消耗 降低了27%-72%,实现了性能和效率的双赢。 论文标题 :Rethinking Temporal Fusion with a Unified Gradient Descent View for ...
黑武士!科研&教学级自动驾驶全栈小车来啦~
自动驾驶之心· 2025-07-01 12:58
Core Viewpoint - The article announces the launch of the "Black Warrior Series 001," a lightweight autonomous driving solution aimed at research and education, with a promotional price of 34,999 yuan and a deposit scheme for early orders [1]. Group 1: Product Overview - The "Black Warrior 001" is developed by the Autonomous Driving Heart team, featuring a comprehensive solution that supports perception, localization, fusion, navigation, and planning, built on an Ackermann chassis [2]. - The product is designed for various educational and research applications, including undergraduate learning, graduate research, and as teaching tools in laboratories and vocational schools [5]. Group 2: Performance and Testing - The product has been tested in multiple environments, including indoor, outdoor, and parking scenarios, demonstrating its capabilities in perception, localization, fusion, navigation, and planning [3]. - Specific tests include 3D point cloud target detection, 2D and 3D laser mapping in indoor parking, and outdoor scene mapping, including night driving capabilities [7][9][11][15][17]. Group 3: Hardware Specifications - Key hardware components include: - 3D LiDAR: Mid 360 - 2D LiDAR: Lidar from Raysun - Depth Camera: Orbbec with IMU - Main Control Chip: Nvidia Orin NX 16G - Display: 1080p [19]. - The vehicle specifications include a weight of 30 kg, a battery power of 50W, a voltage of 24V, and a maximum speed of 2 m/s [21]. Group 4: Software and Functionality - The software framework includes ROS, C++, and Python, supporting one-click startup and providing a development environment [23]. - The system supports various functionalities such as 2D and 3D SLAM, vehicle navigation, and obstacle avoidance [24]. Group 5: After-Sales and Support - The company offers one year of after-sales support for non-human damage, with free repairs for damages caused by operational errors or code modifications during the warranty period [46].
小米社招&校招 | 自动驾驶与具身智能算法研究员 (VLA/具身方向)
自动驾驶之心· 2025-07-01 12:58
Core Insights - The article discusses a job opening for a researcher/scientist in the field of autonomous driving and robotics, focusing on the development of an Embodied Foundation Model that integrates visual-language-action capabilities with advanced spatial perception and reasoning [2][3]. Group 1: Job Responsibilities - The core responsibilities include leading the research and construction of cutting-edge multimodal large models, exploring the understanding of complex 3D worlds, and planning long-term, multi-step tasks through a World Model [3][4]. - Key capabilities to be developed include multimodal scene understanding, complex semantic reasoning and decision-making, and learning and adaptation mechanisms through reinforcement learning, imitation learning, and self-supervised learning [4]. - The role also involves establishing a technical vision and roadmap for a generalized, efficient embodied intelligence model, supporting technological evolution over the next 1-3 years, and exploring its applications in autonomous driving and general robotics [4]. Group 2: Job Requirements - Candidates are expected to have a PhD or equivalent research experience in computer science, artificial intelligence, robotics, autonomous driving, or related fields [5]. - In-depth research and practical experience in one or more areas such as multimodal large models, embodied AI, reinforcement learning, and 3D vision and spatial intelligence are required [6][7]. - The ideal candidate should have a strong academic background, including publications in top conferences and a cross-disciplinary perspective that integrates knowledge from various fields to solve complex real-world problems [8]. Group 3: Additional Qualifications - Experience with World Model theory and large-scale pre-training of models with billions of parameters, as well as deployment and validation of algorithms on real or high-fidelity robotic platforms, is considered advantageous [11]. - Contributions to open-source projects and active participation in relevant communities are also valued [11]. Group 4: Application Details - The position is primarily based in Beijing, with some opportunities in Shanghai, and interested candidates can submit their resumes to a specified email address [10].
重磅直播!清华&博世开源SOTA性能纯血VLA:Impromptu-VLA告别双系统~
自动驾驶之心· 2025-07-01 12:58
Core Viewpoint - The article discusses the advancements and challenges in autonomous driving systems, particularly in unstructured environments, and introduces the Impromptu VLA framework developed by Tsinghua AIR and Bosch Research Institute to address data gaps in these scenarios [1]. Group 1: Advancements in Autonomous Driving - Current autonomous driving systems have made significant progress in structured environments like cities and highways, but face challenges in unstructured scenarios such as rural roads and construction zones [1]. - Existing large-scale autonomous driving datasets primarily focus on conventional traffic conditions, leading to a lack of specialized, large-scale, and finely annotated data for complex unstructured environments [1]. Group 2: Impromptu VLA Framework - The Impromptu VLA framework aims to provide an open-weight and open-data driving vision-language-action model, which is a fully end-to-end system that extracts multimodal features directly from driving video segments [1]. - Impromptu VLA generates driving commands in natural language format without the need for manually designed perception modules or intermediate representations [1]. - In the NeuroNCAP closed-loop safety evaluation system, Impromptu VLA demonstrates strong decision robustness and generalization capabilities, significantly outperforming the latest BridgeAD system proposed at CVPR 2025 (2.15 vs. 1.60) [1].
目标导航到底是什么?自驾有没有落地的点?
自动驾驶之心· 2025-07-01 12:24
Core Viewpoint - Goal-Oriented Navigation empowers robots to autonomously complete navigation tasks based on goal descriptions, marking a significant shift from traditional visual language navigation systems [2][3]. Group 1: Technology Overview - Embodied navigation is a core area of embodied intelligence, relying on three technical pillars: language understanding, environmental perception, and path planning [2]. - Goal-Oriented Navigation allows robots to explore unfamiliar 3D environments and plan paths using only goal descriptions such as coordinates, images, or natural language [2]. - The technology has been industrialized across various verticals, including delivery, healthcare, hospitality, and industrial logistics, showcasing its adaptability and effectiveness [3]. Group 2: Technological Evolution - The evolution of Goal-Oriented Navigation can be categorized into three generations: 1. **First Generation**: End-to-end methods focusing on reinforcement learning and imitation learning, achieving breakthroughs in Point Navigation and closed-set image navigation tasks [5]. 2. **Second Generation**: Modular methods that explicitly construct semantic maps, breaking tasks into exploration and goal localization phases, showing significant advantages in zero-shot object navigation [5]. 3. **Third Generation**: Integration of large language models (LLMs) and visual language models (VLMs) to enhance knowledge reasoning and open-vocabulary target matching accuracy [7]. Group 3: Challenges and Learning Path - The complexity of embodied navigation, particularly Goal-Oriented Navigation, requires knowledge from multiple fields, including natural language processing, computer vision, and reinforcement learning [9]. - The fragmented nature of knowledge and the abundance of literature make it challenging for newcomers to extract frameworks and understand development trends [9]. - A new course has been developed to address these challenges, focusing on practical applications and theoretical foundations to facilitate learning [10][11][12]. Group 4: Course Structure - The course is structured to cover various aspects of Goal-Oriented Navigation, including: 1. **Semantic Navigation Framework**: Establishing theoretical foundations and technical lineage [14]. 2. **Habitat Simulation Ecosystem**: Analyzing the technical architecture of the Habitat platform [15]. 3. **End-to-End Navigation Methodology**: Teaching core algorithms and performance differences [16]. 4. **Modular Navigation Architecture**: Focusing on semantic map construction and task decomposition strategies [17]. 5. **LLM/VLM Driven Navigation Systems**: Exploring integration paradigms and algorithm design [18]. Group 5: Practical Application - The course includes a major project focusing on the replication of VLFM algorithms and real-world deployment, allowing participants to engage in hands-on learning [18][22].