Workflow
自动驾驶之心
icon
Search documents
业务合伙人招募!4D标注/世界模型/VLA/模型部署等方向
自动驾驶之心· 2025-10-02 03:04
Group 1 - The article announces the recruitment of 10 outstanding partners for the autonomous driving sector, focusing on course development, paper guidance, and hardware research [2] - The main areas of expertise sought include large models, multimodal models, diffusion models, end-to-end systems, embodied interaction, joint prediction, SLAM, 3D object detection, world models, closed-loop simulation, and model deployment and quantization [3] - Candidates are preferred from universities ranked within the QS200, holding a master's degree or higher, with priority given to those with significant conference contributions [4] Group 2 - The compensation package includes resource sharing for job seeking, doctoral studies, and overseas study recommendations, along with substantial cash incentives and opportunities for entrepreneurial project collaboration [5] - Interested parties are encouraged to add WeChat for consultation, specifying "organization/company + autonomous driving cooperation inquiry" [6]
BEVTraj:一个端到端的无地图轨迹预测新框架
自动驾驶之心· 2025-10-02 03:04
Core Viewpoint - The article discusses the limitations of high-definition maps in autonomous driving and introduces BEVTraj, a new trajectory prediction framework that operates without relying on maps, achieving performance comparable to state-of-the-art (SOTA) models based on high-definition maps [1][3][26]. Group 1: Background and Challenges - High-definition maps provide structured information that enhances prediction accuracy but have significant drawbacks, including high costs, limited coverage, and inability to adapt to dynamic changes like road construction or accidents [3]. - The reliance on high-definition maps is a major bottleneck for the large-scale deployment of autonomous driving technology [3]. Group 2: Solutions Explored - Two main paths have been explored to address the challenges: online mapping, which still depends on a mapping module, and a map-free approach that utilizes raw sensor data for predictions [4][6]. - BEVTraj represents the latter approach, leveraging raw sensor data to extract sufficient geometric and semantic information for accurate trajectory predictions [4]. Group 3: BEVTraj Framework - BEVTraj operates in a unified bird's-eye view (BEV) space, consisting of a scene context encoder and an iterative deformable decoder [7]. - The scene context encoder extracts rich scene features from multi-modal sensor data and vehicle historical trajectories, generating a dense BEV feature map [11]. - A key innovation is the deformable attention mechanism, which focuses on a small number of critical sampling points in the BEV feature map, enhancing computational efficiency [11]. Group 4: Iterative Refinement and Prediction - The iterative deformable decoder generates final multi-modal trajectory predictions using the deformable attention mechanism and a sparse goal candidate proposal module [13]. - The sparse goal candidate proposal (SGCP) module predicts a limited number of high-quality candidate points based on vehicle dynamics and scene context, streamlining the prediction process [13][14]. Group 5: Experimental Results - BEVTraj's performance is competitive with SOTA models, demonstrating its effectiveness in generating reasonable trajectories even in complex scenarios like sharp turns and intersections [17][20]. - The results indicate that BEVTraj can learn implicit geometric constraints from raw sensor data, achieving a minimum Average Displacement Error (minADE) of 1.4556 and a minimum Final Displacement Error (minFDE) of 8.4384 [18]. Group 6: Summary and Value - BEVTraj marks a milestone in the field of autonomous driving trajectory prediction by validating the feasibility of map-free solutions and enhancing system flexibility and scalability [21][26]. - The framework's efficient end-to-end architecture, utilizing deformable attention and sparse proposals, provides a valuable design paradigm for the industry [26]. - The open-source code will significantly promote research in map-free perception and prediction within the community [26].
华人团队之光!CoRL2025最佳论文(北京通用人工智能研究院&宇树等)
自动驾驶之心· 2025-09-30 16:04
Core Insights - The article highlights significant advancements in the field of robotics and autonomous driving, particularly from the CoRL 2025 conference held in Seoul, South Korea, where notable research papers were awarded for their contributions to the field [2][10]. Research Highlights - The Best Paper award was given to a collaborative research effort from the Beijing Academy of Artificial Intelligence, Yushu Technology, and Beijing University of Posts and Telecommunications for their work titled "Learning a Unified Policy for Position and Force Control in Legged Loco-Manipulation," focusing on a hybrid force/position control model [2][10]. - The Best Student Paper award went to a team from the University of California, Berkeley for their paper "Visual Imitation Enables Contextual Humanoid Control," which addresses motion control across embodied agents [5][10]. Finalist Overview - Several finalist papers were recognized, including: - "LocoFormer: Generalist Locomotion via Long-context Adaptation," which involves motion control for embodied agents [10]. - "Fabrica: Dual-Arm Assembly of General Multi-Part Objects via Integrated Planning and Learning," focusing on dual-arm planning and control strategies [10]. - "DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation," which explores human-robot interaction and data collection [10]. - "The Sound of Learning Simulation: Multimodal Sim-to-Real Robot Policies with Generative Audio," which combines generative models with multimodal approaches [10]. - "Pi 0.5: a Vision-Language-Action Model with Open-World Generalization," which presents a foundational model for vision-language-action tasks [10]. - "Steering Your Diffusion Policy with Latent Space Reinforcement Learning," which integrates generative models with reinforcement learning [10]. Community Engagement - The article mentions the establishment of nearly 100 technical discussion groups related to various aspects of autonomous driving, including large models, end-to-end systems, and multi-modal perception [12][14]. - A community of approximately 4,000 members has been formed, with over 300 autonomous driving companies and research institutions participating, covering more than 30 learning pathways in autonomous driving technology [14].
纯血VLA综述来啦!从VLM到扩散,再到强化学习方案
自动驾驶之心· 2025-09-30 16:04
Core Insights - The article discusses the emergence and potential of Vision Language Action (VLA) models in robotics, emphasizing their ability to integrate perception, language understanding, and action execution into a unified framework [10][16]. Group 1: Introduction and Background - Robotics has evolved from relying on pre-programmed instructions to utilizing deep learning for multi-modal data processing, enhancing capabilities in perception and action [1][10]. - The introduction of large language models (LLMs) and vision-language models (VLMs) has significantly improved the flexibility and precision of robotic operations [1][10]. Group 2: Current State of VLA Models - VLA methods are categorized into four paradigms: autoregressive, diffusion, reinforcement learning, and hybrid/specialized methods, each with unique strategies and mechanisms [7][9]. - The development of VLA models is heavily dependent on high-quality datasets and realistic simulation platforms, which are crucial for training and evaluation [15][17]. Group 3: Challenges and Future Directions - Key challenges in VLA research include data limitations, reasoning speed, and safety concerns, which need to be addressed to advance the field [7][9]. - Future research directions are identified, focusing on enhancing generalization capabilities, improving interaction with dynamic environments, and ensuring robust performance in real-world applications [16][17]. Group 4: Methodological Innovations - The article highlights the transition from traditional robotic systems to VLA models, which unify visual perception, language understanding, and executable control in a single framework [13][16]. - Innovations in VLA methodologies include the integration of autoregressive models for action generation, diffusion models for probabilistic action generation, and reinforcement learning for policy optimization [18][32]. Group 5: Applications and Impact - VLA models have been applied across various robotic platforms, including robotic arms, quadrupeds, humanoid robots, and autonomous vehicles, showcasing their versatility [7][15]. - The integration of VLA models is seen as a significant step towards achieving general embodied intelligence, enabling robots to perform a wider range of tasks in diverse environments [16][17].
英伟达自动驾驶算法工程师面试
自动驾驶之心· 2025-09-29 23:33
Core Insights - The article discusses the intricacies of job interviews in the autonomous driving sector, particularly focusing on the detailed role divisions within companies like NV and the technical expectations from candidates [3][4][5][8][11][12][14]. Group 1: Interview Process - The interview process for positions in autonomous driving involves multiple rounds, including technical assessments and coding challenges, with a focus on specific skills such as dynamic programming and algorithm optimization [4][5][8][11][12]. - Candidates are expected to demonstrate their understanding of advanced concepts like Model Predictive Control (MPC), Simultaneous Localization and Mapping (SLAM), and various optimization techniques [5][8][12][14]. - The coding challenges often include data structure manipulations, such as linked lists and dynamic programming problems, which are critical for assessing a candidate's problem-solving abilities [6][11][14]. Group 2: Technical Skills and Knowledge - A strong grasp of algorithms, particularly in the context of planning and control for autonomous vehicles, is essential. Candidates are often asked to explain complex algorithms like hybrid A* and kinodynamic-RRT [12][14]. - Knowledge of deep learning, especially in image processing and object detection, is increasingly important in the autonomous driving field, reflecting the industry's shift towards integrating AI technologies [11][12][14]. - Candidates are also evaluated on their ability to communicate technical concepts clearly, indicating the importance of both technical and soft skills in the hiring process [8][11][12]. Group 3: Industry Trends - The autonomous driving industry is experiencing a convergence of technology stacks, with a move towards unified models and higher technical barriers, which may impact job roles and required skills [22]. - There is a growing community focused on sharing knowledge and resources related to job opportunities and industry developments, highlighting the collaborative nature of the field [19][22]. - The article emphasizes the importance of networking and community engagement for professionals seeking to advance their careers in autonomous driving [22].
有人在自驾里面盲目内卷,而有的人在搭建真正的壁垒...
自动驾驶之心· 2025-09-29 23:33
Core Viewpoint - The automotive industry is undergoing a significant transformation, with numerous executive changes and a focus on advanced technologies such as autonomous driving and artificial intelligence [1][3]. Group 1: Industry Changes - In September, 48 executives in the automotive sector underwent changes, indicating a shift in leadership and strategy [1]. - Companies like Li Auto and BYD are restructuring their teams to enhance their capabilities in autonomous driving and cockpit technology [1]. - The industry is witnessing a rapid evolution in algorithm development, moving from BEV to more complex models like VLA and world models [1][3]. Group 2: Autonomous Driving Focus - The forefront of autonomous driving technology is centered on VLA/VLM, end-to-end driving, world models, and reinforcement learning [3]. - There is a notable gap in understanding the industry's actual progress among students and mid-sized companies, highlighting the need for better communication between academia and industry [3]. Group 3: Community and Knowledge Sharing - A community called "Autonomous Driving Heart Knowledge Planet" has been established to bridge the gap between academic and industrial knowledge, aiming to grow to nearly 10,000 members in two years [5]. - The community offers a comprehensive platform for learning, including video content, Q&A, and job exchange, catering to both beginners and advanced learners [6][10]. - Members can access over 40 technical routes and engage with industry leaders to discuss trends and challenges in autonomous driving [6][8]. Group 4: Learning Resources - The community provides various resources for practical questions related to autonomous driving, such as entry points for end-to-end systems and data annotation practices [6][11]. - A detailed curriculum is available for newcomers, covering essential topics in autonomous driving technology [20][21]. - The platform also includes job referral mechanisms to connect members with potential employers in the autonomous driving sector [13][14].
基于开源Qwen2.5-VL实现自动驾驶VLM微调
自动驾驶之心· 2025-09-29 23:33
Core Viewpoint - The article discusses the development and application of LLaMA Factory, an open-source low-code framework for fine-tuning large models, particularly in the context of autonomous driving and visual-language models (VLM) [1][2]. Group 1: LLaMA Factory Overview - LLaMA Factory integrates widely used fine-tuning techniques and has become one of the most popular frameworks in the open-source community, with over 40,000 stars on GitHub [1]. - The framework is designed to train models like Qwen2.5-VL-7B-Instruct, which can provide traffic situation assessments through natural language interactions [1]. Group 2: Qwen2.5-VL Model - Qwen2.5-VL is the flagship model in the Qwen visual-language series, achieving significant breakthroughs in visual recognition, object localization, document parsing, and long video understanding [2]. - The model supports dynamic resolution processing and absolute time encoding, allowing it to handle images of various sizes and videos lasting several hours [2]. - It offers three model sizes, with the flagship Qwen2.5-VL-72B performing comparably to advanced models like GPT-4o and Claude 3.5 Sonnet [2]. Group 3: CoVLA Dataset - CoVLA (Comprehensive Vision-Language-Action) is a dataset designed for autonomous driving, containing 10,000 real driving scenes and over 80 hours of video [3]. - The dataset utilizes scalable methods to generate precise driving trajectories from raw sensor data, accompanied by detailed natural language descriptions [3]. - CoVLA surpasses existing datasets in scale and annotation richness, providing a comprehensive platform for training and evaluating visual-language-action models [3]. Group 4: Model and Dataset Installation - Instructions are provided for downloading and installing LLaMA Factory and the Qwen2.5-VL model, including commands for cloning the repository and installing necessary packages [4][5][6]. - The article emphasizes the importance of configuring local paths for images and datasets to ensure proper functionality [7][13]. Group 5: Fine-tuning Process - The fine-tuning process is tracked using SwanLab, an open-source tool for visualizing AI model training [11]. - After fine-tuning, the model's performance is evaluated through a web UI, allowing users to interact with the model and assess its responses to various queries related to autonomous driving [20][21]. - The article notes that the fine-tuned model provides more relevant answers compared to the original model, which may produce less focused responses [22].
工业界大佬带队!三个月搞定端到端自动驾驶
自动驾驶之心· 2025-09-29 08:45
Core Viewpoint - 2023 is identified as the year of end-to-end production, with 2024 expected to be a significant year for this development in the automotive industry, particularly in autonomous driving technology [1][3]. Group 1: End-to-End Production - Leading new forces and manufacturers have already achieved end-to-end production [1]. - There are two main paradigms in the industry: one-stage and two-stage approaches, with UniAD being a representative of the one-stage method [1]. Group 2: Development Trends - Since last year, the one-stage end-to-end approach has rapidly evolved, leading to various derivatives such as perception-based, world model-based, diffusion model-based, and VLA-based one-stage methods [3]. - Major autonomous driving companies are focusing on self-research and mass production of end-to-end autonomous driving solutions [3]. Group 3: Course Offerings - A course titled "End-to-End and VLA Autonomous Driving" has been launched, covering cutting-edge algorithms in both one-stage and two-stage end-to-end approaches [5]. - The course aims to provide insights into the latest technologies in the field, including BEV perception, visual language models, diffusion models, and reinforcement learning [5]. Group 4: Course Structure - The course consists of several chapters, starting with an introduction to end-to-end algorithms, followed by background knowledge essential for understanding the technology stack [9][10]. - The second chapter focuses on the most frequently asked technical keywords in job interviews over the next two years [10]. - Subsequent chapters delve into two-stage end-to-end methods, one-stage end-to-end methods, and practical assignments involving RLHF fine-tuning [12][13]. Group 5: Learning Outcomes - Upon completion, participants are expected to reach a level equivalent to one year of experience as an end-to-end autonomous driving algorithm engineer [19]. - The course aims to deepen understanding of key technologies such as BEV perception, multimodal large models, and reinforcement learning, enabling participants to apply learned concepts to real projects [19].
好用,便宜!面向具身科研领域打造的轻量级机械臂
自动驾驶之心· 2025-09-28 23:33
Core Viewpoint - The article introduces the Imeta-Y1, a lightweight and cost-effective robotic arm designed for the embodied research field, addressing the need for affordable yet high-quality hardware for researchers and practitioners [2][4]. Group 1: Product Features - The Imeta-Y1 robotic arm is specifically designed for education, research, and light industrial applications, featuring high-precision motion control, low power consumption, and an open software and hardware architecture [4][5]. - It supports seamless integration from simulation to real machine, providing a complete open-source SDK and toolchain for users to quickly implement algorithm validation, data collection, model training, and application deployment [4][15]. - The arm has a compact structure and modular interfaces, making it particularly suitable for embedded AI and robotic learning platform development [5]. Group 2: Technical Specifications - The Imeta-Y1 has a weight of 4.2 kg, a rated load of 3 kg, and offers 6 degrees of freedom with a working radius of 612.5 mm and a repeat positioning accuracy of ±0.1 mm [8][17]. - It operates at a supply voltage of 24V and utilizes a PC controller, with communication via CAN [8][17]. - The joint motion range includes J1 from -165° to 165°, J2 from -180° to 0°, and J3 from 0° to 180°, with maximum joint speeds of 180°/s for J1, J2, and J3, and 220°/s for J4, J5, and J6 [8][17]. Group 3: Development and Support - The product provides a comprehensive toolchain for data collection, model training, and inference deployment, supporting multi-modal data fusion and compatibility with mainstream frameworks like TensorFlow and PyTorch [15][31]. - It offers C++ and Python development interfaces, allowing developers to quickly get started regardless of their programming language proficiency [16]. - The company ensures timely after-sales support, with a response time within 24 hours, and offers bulk purchase discounts and project development support [17][48].
MTRDrive:一种具备动态交互式推理的自动驾驶VLA框架(清华&小米)
自动驾驶之心· 2025-09-28 23:33
Core Insights - The article discusses the MTRDrive framework, which models autonomous driving as a dynamic interactive reasoning process, addressing the limitations of traditional static decision-making approaches [4][9][50] - MTRDrive integrates a memory-tool synergistic mechanism to enhance perception accuracy and reasoning reliability, significantly improving the model's robustness in long-tail and out-of-distribution (OOD) scenarios [4][13][50] Group 1: Challenges in Autonomous Driving - Current visual-language-action (VLA) models face significant challenges in long-term reasoning and high-level decision-making, particularly in complex scenarios with few or no samples [3][5] - Robust driving decisions rely heavily on the deep collaboration of perception accuracy and reasoning reliability, akin to human drivers who utilize accumulated experience for dynamic prediction and adaptive adjustments [3][8] Group 2: MTRDrive Framework - MTRDrive is a new framework proposed by teams from Tsinghua University, Xiaomi Auto, McGill University, and the University of Wisconsin-Madison, which breaks the limitations of traditional static decision-making [4][9] - The framework includes a memory-tool collaborative mechanism that enhances the model's perception accuracy and supports robust decision-making in long-term and high-level tasks [4][15] Group 3: Experimental Validation - Systematic experiments demonstrate that MTRDrive significantly improves generalization and robustness in long-tail and OOD scenarios, providing a new technical pathway for deploying autonomous agents in real-world complex environments [4][34] - In high-level planning tasks, MTRDrive achieved a planning accuracy of 82.6% on the NAVSIM dataset, more than double that of the Qwen2.5-VL-72B model [40] Group 4: Memory and Tool Interaction - MTRDrive incorporates a structured driving experience repository that allows the model to retrieve relevant past experiences, enhancing its decision-making capabilities [15][19] - The framework employs a visual toolset that enables the model to actively probe the visual environment for high-fidelity information, improving its perception capabilities [21][28] Group 5: Training Methodology - MTRDrive utilizes a two-phase training process: supervised fine-tuning (SFT) to teach basic skills and reinforcement learning fine-tuning (RLFT) for optimizing decision-making capabilities [24][29] - The introduction of a memory retrieval mechanism significantly enhances the model's ability to generalize skills to new, unseen driving scenarios, as evidenced by improved performance metrics [44]