Workflow
自动驾驶之心
icon
Search documents
新国立×上交发布RoboCerebra:长时序机器人操作推理的全新评测基准
自动驾驶之心· 2025-06-29 11:33
Core Insights - The article discusses the development of RoboCerebra, a new benchmark designed to evaluate long-horizon robotic manipulation tasks, emphasizing the need for collaboration between high-level planning (VLM) and low-level control (VLA) models [6][8][10]. Group 1: Background and Motivation - Recent advancements in visual-language models (VLM) have enabled robots to execute commands based on visual inputs, but challenges arise when tasks become more complex, requiring long-term planning and memory management [6][7]. - Existing benchmarks often fail to assess the collaborative capabilities of VLM and VLA, leading to performance issues in dynamic environments [8]. Group 2: RoboCerebra Contributions - RoboCerebra includes a large-scale dataset and a systematic benchmark for evaluating cognitive challenges related to planning, memory, and reflection in robotic tasks [10]. - The dataset construction process integrates automated generation and manual annotation to ensure high quality and scalability [10]. Group 3: Task Setting - The benchmark features long task sequences averaging 2,972 steps, with dynamic disturbances introduced to challenge the models' planning and recovery abilities [11]. - A top-down data generation pipeline utilizes GPT to create high-level tasks, which are then broken down into sub-goals and validated for logical consistency and physical feasibility [11][13]. Group 4: Evaluation Protocol and Metrics - RoboCerebra employs a four-dimensional evaluation framework assessing success rate, plan match accuracy, plan efficiency, and action completion accuracy to measure the collaboration between VLM and VLA [15][21]. - The framework includes anchor points to synchronize evaluations across different models, ensuring consistency in task execution [21]. Group 5: Experimental Results - The hierarchical planning and execution framework significantly improves task success rates, particularly in memory execution scenarios, demonstrating the necessity of collaboration between VLM and VLA [27]. - The results indicate that using either the VLA or VLM alone is insufficient for stable performance in complex tasks, highlighting the importance of their integration [27][28]. Group 6: Memory Task Evaluation - The evaluation of memory tasks shows that the VLM's reasoning capabilities are crucial for both memory exploration and execution, with GPT-4o outperforming other models in exploration success rates and decision accuracy [31][32].
大会预告!无人驾驶专用车技术与产业发展大会
自动驾驶之心· 2025-06-29 11:33
Group 1 - The conference on the development of dedicated vehicles for autonomous driving will be held from October 23 to 24, 2025, in Chongqing [1] - The event is guided by the China Society of Automotive Engineering and organized by the Automotive Intelligent Transportation Subcommittee [1][3][5] - Co-organizers include the Shanghai Research Institute for Intelligent Autonomous Systems and Tongji University School of Automotive Engineering [1][7][8] Group 2 - The conference aims to facilitate communication and collaboration among industry professionals [1] - Interested parties are encouraged to reach out for cooperation or participation [1]
当下自动驾驶的技术发展,重建还有哪些应用?
自动驾驶之心· 2025-06-29 08:19
Core Viewpoint - The article discusses the evolving landscape of 4D annotation in autonomous driving, emphasizing the shift from traditional SLAM techniques to more advanced methods for static element reconstruction and automatic labeling [1][4]. Group 1: Purpose and Applications of Reconstruction - The primary purposes of reconstruction are to create 3D maps from lidar or multiple cameras and to output vector lane lines and categories [5][6]. - The application of 4D annotation in static elements remains broad, with a focus on lane markings and static obstacles, which require 2D spatial annotations at each timestamp [1][6]. Group 2: Challenges in Automatic Annotation - The challenges in 4D automatic annotation include high temporal consistency requirements, complex multi-modal data fusion, difficulties in generalizing dynamic scenes, conflicts between annotation efficiency and cost, and high demands for scene generalization in production [8][9]. - These challenges hinder the iterative efficiency of data loops in autonomous driving, impacting the system's generalization capabilities and safety [8]. Group 3: Course Structure and Content - The course on 4D automatic annotation covers a comprehensive curriculum, including dynamic obstacle detection, SLAM reconstruction principles, static element annotation based on reconstruction graphs, and the end-to-end truth generation process [9][10][17]. - Each chapter includes practical exercises to enhance understanding and application of the algorithms discussed [9][10]. Group 4: Instructor and Target Audience - The course is led by an industry expert with extensive experience in multi-modal 3D perception and data loop algorithms, having participated in multiple production delivery projects [21]. - The target audience includes researchers, students, and professionals looking to transition into the data loop field, requiring a foundational understanding of deep learning and autonomous driving perception algorithms [24][25].
聊一聊:上次让你研究了几天的,是啥自动驾驶相关的论文?
自动驾驶之心· 2025-06-29 07:36
星球创建的初衷是为了给自动驾驶行业提供一个技术交流平台,交流学术和工程上的问题。星球成员主要来在校本 科/硕士/博士生,以及想要转行或者进阶的算法工程人员,这其中包括但不限于:清华大学、北京大学、复旦大 学、德州农工、西湖大学、上海交大、上海人工智能实验室、港科大、港大、港中文、南洋理工、新加坡国立、 ETH、南京大学等等;除此之外,我们还和许多公司建立了校招/社招内推,包括小米汽车、地平线、理想汽车、小 鹏、英伟达、比亚迪、华为、大疆、博世、斑马、Momenta、蔚来、百度等等业界知名公司! 如果您是自动驾驶和AI公司的创始人、高管、产品经理、运营人员或者数据/高精地图相关公司,也非常欢迎加入, 资源的对接与引进也是我们一直在推动的!我们坚信自动驾驶能够改变人类未来出行,想要加入该行业推动社会进 步的小伙伴们,星球内部准备了基础到进阶模块,算法讲解+代码实现,轻松搞定学习! 自动驾驶之心知识星球 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 ...
2025年,找工作有些迷茫。。。
自动驾驶之心· 2025-06-28 13:34
Core Insights - The article highlights the rapid advancements in AI technologies, particularly in autonomous driving and embodied intelligence, which have significantly influenced the industry and attracted substantial investment [2] - A new platform, AutoRobo Knowledge Community, has been launched to assist job seekers in the fields of robotics, autonomous driving, and embodied intelligence, facilitating connections and providing resources [2][3] Group 1: Community and Resources - AutoRobo Knowledge Community has nearly 1,000 members, including professionals from companies like Horizon Robotics, Li Auto, Huawei, and Xiaomi, as well as students preparing for upcoming job fairs [2] - The community offers a variety of resources, including interview questions, industry reports, salary negotiation tips, and resume optimization services [3][4] Group 2: Interview Preparation - The community has compiled a list of 100 common interview questions related to autonomous driving and embodied intelligence, covering various technical aspects [6][7] - Specific topics include sensor fusion, lane detection algorithms, and multi-modal 3D object detection, providing comprehensive preparation materials for job seekers [7][11] Group 3: Industry Insights - The community provides access to industry reports that detail the current state, development trends, and market opportunities within the autonomous driving and embodied intelligence sectors [12][15] - Reports include insights into the Chinese humanoid robot market and the overall landscape of embodied intelligence, helping members understand the industry's dynamics [15]
放榜了!ICCV 2025最新汇总(自驾/具身/3D视觉/LLM/CV等)
自动驾驶之心· 2025-06-28 13:34
Core Insights - The article discusses the recent ICCV conference, highlighting the excitement around the release of various works related to autonomous driving and the advancements in the field [2]. Group 1: Autonomous Driving Innovations - DriveArena is introduced as a controllable generative simulation platform aimed at enhancing autonomous driving capabilities [4]. - Epona presents an autoregressive diffusion world model specifically designed for autonomous driving applications [4]. - SynthDrive offers a scalable Real2Sim2Real sensor simulation pipeline for high-fidelity asset generation and driving data synthesis [4]. - StableDepth focuses on scene-consistent and scale-invariant monocular depth estimation, which is crucial for improving perception in autonomous vehicles [4]. - CoopTrack explores end-to-end learning for efficient cooperative sequential perception, enhancing the collaborative capabilities of autonomous systems [4]. Group 2: Image and Vision Technologies - CycleVAR repurposes autoregressive models for unsupervised one-step image translation, which can be beneficial for visual recognition tasks in autonomous driving [5]. - CoST emphasizes efficient collaborative perception from a unified spatiotemporal perspective, which is essential for real-time decision-making in autonomous vehicles [5]. - Hi3DGen generates high-fidelity 3D geometry from images via normal bridging, improving the spatial understanding of environments for autonomous systems [5]. - GS-Occ3D focuses on scaling vision-only occupancy reconstruction for autonomous driving using Gaussian splatting techniques [5]. Group 3: Large Model Applications - ETA introduces a dual approach to self-driving with large models, enhancing the efficiency and effectiveness of autonomous driving systems [5]. - Taming the Untamed discusses graph-based knowledge retrieval and reasoning for multi-layered large models (MLLMs), which can significantly improve the decision-making processes in autonomous driving [7].
中科院&字节提出BridgeVLA!斩获CVPR 2025 workshop冠军~
自动驾驶之心· 2025-06-28 13:34
Core Viewpoint - The article discusses the introduction of BridgeVLA, a new paradigm for 3D Vision-Language Action (VLA) models that enhances data efficiency and operational effectiveness in robotic manipulation tasks [3][21]. Group 1: Introduction of BridgeVLA - BridgeVLA integrates the strengths of existing 2D and 3D models by aligning inputs and outputs in a unified 2D space, thereby bridging the gap between Vision-Language Models (VLM) and VLA [5][21]. - The model has achieved a significant success rate of 88.2% in the RLBench benchmark, outperforming all existing baseline methods [14][19]. Group 2: Pre-training and Fine-tuning - The pre-training phase involves equipping the VLM with the ability to predict 2D Heatmaps from image-target text pairs, enhancing its target detection capabilities [8][10]. - During fine-tuning, BridgeVLA predicts actions by utilizing point clouds and instruction text, aligning the input with the pre-training phase to ensure consistency [11][12]. Group 3: Experimental Results - In RLBench, BridgeVLA improved the average success rate from 81.4% to 88.2%, particularly excelling in high-precision tasks [14][15]. - The model demonstrated robust performance in the COLOSSEUM benchmark, increasing the average success rate from 56.7% to 64.0% across various perturbations [16][19]. Group 4: Real-World Testing - In real-world evaluations, BridgeVLA outperformed the leading baseline RVT-2 in six out of seven settings, showcasing its robustness against visual disturbances [18][19]. - The model's ability to retain pre-training knowledge after fine-tuning indicates its effective learning and generalization capabilities [19]. Group 5: Future Directions - Future research will explore diverse pre-training tasks to enhance the model's general visual understanding and consider integrating more expressive action decoding methods to improve strategy performance [21]. - There is a plan to address the challenges of long-horizon tasks by utilizing large language models (LLMs) for task decomposition [21].
何恺明CVPR 2025报告深度解读:生成模型如何迈向端到端?
自动驾驶之心· 2025-06-28 13:34
Core Viewpoint - The article discusses the evolution of generative models in deep learning, drawing parallels to the revolutionary changes brought by AlexNet in recognition models, and posits that generative models may be on the brink of a similar breakthrough with the introduction of MeanFlow, which simplifies the generation process from multiple steps to a single step [1][2][35]. Group 1: Evolution of Recognition Models - Prior to AlexNet, layer-wise training was the dominant method for training recognition models, which involved optimizing each layer individually, leading to complex and cumbersome training processes [2][3]. - The introduction of AlexNet in 2012 marked a significant shift to end-to-end training, allowing the entire network to be trained simultaneously, greatly simplifying model design and improving performance [3][7]. Group 2: Current State of Generative Models - Generative models today resemble the pre-AlexNet era of recognition models, relying on multi-step reasoning processes, such as diffusion models and autoregressive models, which raises the question of whether they are in a similar "pre-AlexNet" phase [7][9]. - The article emphasizes the need for generative models to transition from multi-step reasoning to end-to-end generation to achieve a revolutionary breakthrough [7][35]. Group 3: Relationship Between Recognition and Generation - Recognition and generation can be viewed as two sides of the same coin, with recognition being an abstract process that extracts semantic information from data, while generation is a concrete process that transforms abstract representations into realistic data samples [13][15][16]. - The fundamental difference lies in the nature of the mapping: recognition has a deterministic mapping from data to labels, while generation involves a highly nonlinear mapping from noise to complex data distributions, presenting both opportunities and challenges [18][20]. Group 4: Flow Matching and Mean Flows - Flow matching is a key exploration direction for addressing the challenges faced by generative models, aiming to construct a flow field of data distributions to facilitate generation [20][22]. - Mean Flows, a recent method introduced by Kaiming, seeks to achieve one-step generation by replacing complex integral calculations with average velocity computations, significantly enhancing generation efficiency [24][27][29]. - In experiments, Mean Flows demonstrated impressive performance on ImageNet tasks, achieving a FID score of 3.43 with a single function evaluation, outperforming traditional multi-step models [31][32]. Group 5: Future Directions and Challenges - The article outlines several future research directions, including consistency models, two-time-variable models, and revisiting normalizing flows, while questioning whether generative models are still in the "pre-AlexNet" era [33][34]. - Despite the advancements made by Mean Flows, the challenge remains to identify a truly effective formula for end-to-end generative modeling, which is an exciting and open research question [34][35].
之心急聘!25年业务合伙人招聘,量大管饱~
自动驾驶之心· 2025-06-27 09:34
Group 1 - The article discusses the recruitment of 10 outstanding partners for the "Autonomous Driving Heart" team, focusing on the development of autonomous driving-related courses, thesis guidance, and hardware development [2][3] - The main areas of expertise sought include large models/multi-modal large models, diffusion models, VLA, end-to-end systems, embodied interaction, joint prediction, SLAM, 3D object detection, world models, closed-loop simulation 3DGS, and large model deployment and quantized perception reasoning [3] - Candidates are preferred to have a master's degree or higher from universities ranked within the QS200, with priority given to those who have significant contributions in top conferences [4] Group 2 - The company offers various benefits including resource sharing for job seeking, doctoral studies, and studying abroad recommendations, along with substantial cash incentives and opportunities for entrepreneurial project collaboration [5][6] - Interested parties are encouraged to contact the company via WeChat for consultation regarding institutional or corporate collaboration in autonomous driving [7]
商汤绝影世界模型负责人离职。。。
自动驾驶之心· 2025-06-27 09:15
Core Viewpoint - The article discusses the challenges and opportunities faced by SenseTime's autonomous driving division, particularly in the context of market competition and technological advancements. Group 1: Leadership Changes - The head of the world model development for SenseTime's autonomous driving division has left the company, which raises concerns about the continuity of their cloud technology system and the R-UniAD development [2][3]. Group 2: Market Dynamics - 2025 is expected to be a challenging year for SenseTime's autonomous driving division, with significant upgrades in the mid-tier market, transitioning from highway NOA to full-area NOA [4]. - The mid-tier market will see algorithm upgrades that could reshape the competitive landscape, with companies that keep pace remaining viable while others may be pushed out [4]. Group 3: High-End Strategy - The focus for high-end projects this year is on a one-stage end-to-end solution, which has shown impressive performance, leading manufacturers to prioritize this in their tenders [5]. - SenseTime's UniAD one-stage end-to-end solution has partnered with Dongfeng Motor, aiming for mass production by Q4 2025, which is crucial for establishing a foothold in the high-end market [5][6]. Group 4: Competitive Positioning - SenseTime's ability to deliver a benchmark project in the high-end segment is critical for gaining credibility with manufacturers and securing further projects [6][7]. - The current window of opportunity for SenseTime in the high-end market is limited, as many manufacturers are testing high-end models this year, which could lead to a saturation of opportunities [6][8].