自动驾驶之心
Search documents
国内首个自动驾驶VLA实战课程来了(模块化/一体化/推理增强VLA)
自动驾驶之心· 2025-09-16 10:49
Core Viewpoint - The article discusses the transition in intelligent driving technology from rule-driven to data-driven approaches, highlighting the limitations of end-to-end models in complex scenarios and the potential of VLA (Vision-Language Action) as a more streamlined solution [1][2]. Summary by Sections Introduction to VLA - The article emphasizes the ongoing challenges in the VLA technology stack, noting the proliferation of algorithms and the difficulties faced by newcomers in navigating this complex field [2]. Course Development - A new course titled "Practical Tutorial on Autonomous Driving VLA" has been developed in collaboration with academic teams to address the challenges in learning VLA technology, providing a comprehensive overview of the technical stack involved [2][3]. Course Features - The course is designed to: - Address pain points and facilitate quick entry into the field through accessible language and case studies [3]. - Build a framework for research capabilities by helping students categorize papers and extract innovative points [4]. - Combine theory with practice, ensuring a complete learning loop [5]. Course Outline - The course covers various topics, including the origins of VLA, foundational algorithms, and the construction of datasets for VLA [6][15][19]. Chapter Breakdown - **Chapter 1**: Overview of VLA algorithms and their historical development, including benchmarks and evaluation metrics [15]. - **Chapter 2**: Focus on foundational algorithms related to Vision, Language, and Action modules, including deployment of large models [17]. - **Chapter 3**: Discussion of VLM as an interpreter in autonomous driving, covering classic and cutting-edge algorithms [19]. - **Chapter 4**: Examination of modular and integrated VLA, detailing the evolution of language models in planning and control [21]. - **Chapter 5**: Exploration of reasoning-enhanced VLA, emphasizing the integration of reasoning modules in decision-making processes [24]. - **Chapter 6**: A major project where students will build their own networks and datasets, focusing on practical application [26]. Instructor Background - The course is led by experienced instructors with a strong background in multimodal perception, autonomous driving VLA, and large model frameworks [27]. Learning Outcomes - Upon completion, students are expected to have a thorough understanding of current advancements in VLA, core algorithms, and practical applications in projects [29][31].
BEVTraj:一个端到端的无地图轨迹预测新框架
自动驾驶之心· 2025-09-16 07:22
Core Viewpoint - The article discusses the limitations of high-definition maps in autonomous driving and introduces BEVTraj, a new trajectory prediction framework that operates without relying on maps, achieving performance comparable to state-of-the-art (SOTA) models based on high-definition maps [1][3][26]. Group 1: Background and Challenges - High-definition maps provide structured information that enhances prediction accuracy but have significant drawbacks, including high costs, limited coverage, and inability to adapt to dynamic changes like road construction or accidents [3]. - The reliance on high-definition maps is a major bottleneck for the large-scale deployment of autonomous driving technology [3]. Group 2: Proposed Solutions - Two main approaches are explored to address the limitations of high-definition maps: online mapping and map-free methods. BEVTraj represents the latter, leveraging raw sensor data to support accurate trajectory predictions [4][6]. Group 3: BEVTraj Framework - BEVTraj operates in a unified bird's-eye view (BEV) space, consisting of a scene context encoder and an iterative deformable decoder [7]. - The scene context encoder extracts rich scene features from multi-modal sensor data and vehicle historical trajectories, generating a dense BEV feature map [11]. - The introduction of deformable attention allows the model to focus on key sampling points within the BEV feature map, enhancing computational efficiency [11]. Group 4: Iterative Refinement and Prediction - The iterative deformable decoder generates final multi-modal trajectory predictions, utilizing a sparse goal candidate proposal module that predicts a limited number of high-quality candidate points, improving efficiency [13][14]. - The iterative refinement process adjusts the predicted trajectories based on the surrounding environment, ensuring they align with real road structures [14]. Group 5: Experimental Results - BEVTraj demonstrates performance that rivals SOTA models based on high-definition maps, with metrics such as minADE and minFDE showing competitive results [18][20]. - Even in complex scenarios like sharp turns and intersections, BEVTraj generates reasonable and lane-aligned trajectories, indicating its ability to learn geometric constraints from raw sensor data [20]. Group 6: Summary and Value - The introduction of BEVTraj marks a milestone in the field of autonomous driving trajectory prediction, validating the feasibility of map-free approaches [26]. - It enhances system flexibility and scalability by eliminating dependence on high-definition maps, facilitating broader deployment [26]. - The efficient end-to-end architecture, utilizing deformable attention and sparse goal proposals, provides a valuable design paradigm for the industry [26]. - The open-source code will significantly promote research in map-free perception and prediction [26].
中国具身智能的技术一号位们
自动驾驶之心· 2025-09-16 03:34
Core Viewpoint - The article highlights the rapid development and commercialization of embodied intelligence, emphasizing the competitive landscape among global teams and the importance of technological breakthroughs in the field [4][5]. Group 1: Industry Overview - The last two years have seen significant advancements in hardware, data collection, and algorithms, leading to the expansion of embodied intelligence beyond laboratory settings [4]. - Embodied intelligence has become a recognized core direction for commercialization globally, with various teams competing intensely in this space [4]. - The next generation of technological breakthroughs will focus on general embodied intelligence and scene-adaptive learning [4]. Group 2: Key Players in Embodied Intelligence - **Yushu Technology**: Founded by Wang Xingxing, the company specializes in quadruped robots and has developed multiple models, including Laikago and AlienGo. Wang has over 10 years of experience in robot development and holds over 100 patents [8]. - **Xinghai Map**: Co-founded by Zhao Xing, the company focuses on embodied intelligence and multimodal learning, contributing to the development of the first mass-produced autonomous driving model based on large models [12][13]. - **Galaxy General**: Founded by Wang He, the company is dedicated to embodied intelligence and humanoid robots, with significant research contributions in 3D vision and robot learning [18]. - **Zhiyuan Robotics**: Led by Luo Jianlan, the company focuses on high-precision assembly tasks using reinforcement learning, achieving a 100% success rate in real-world applications [23]. - **Variable Robotics**: Co-founded by Wang Hao, the company aims to integrate large models with embodied intelligence, launching the WALL-A model, which is the largest operational model globally [26]. - **Zhuji Power**: Founded by Zhang Wei, the company is developing full-size humanoid robots and has launched the W1 commercial robot, with plans for mass production of humanoid robots by 2025 [30]. - **Stardust Intelligence**: Founded by Lai Jie, the company focuses on creating AI robots for household use, achieving breakthroughs in embodied intelligence data acquisition [32]. - **Cloud Deep**: Founded by Zhu Qiuguo, the company specializes in humanoid and quadruped robots, with a strong emphasis on self-research and development of core components [34]. - **Qianxun Intelligence**: Founded by Han Fengtao, the company has developed the Moz1 humanoid robot, which features advanced control capabilities and has raised over 1 billion yuan in funding [38]. - **Physical Intelligence**: Co-founded by Sergey Levine, the company focuses on creating advanced AI models for robots, achieving significant funding milestones and technological advancements [40][41]. - **Figure AI**: Founded by Brett Adcock, the company has developed humanoid robots for commercial applications, with significant advancements in collaborative robot control [44][45]. Group 3: Future Outlook - The article concludes that the vision and persistence of technology leaders are crucial for advancing the industry, with various paths being taken towards a flexible, adaptive, and highly interactive future in embodied intelligence [54][55].
蚂蚁集团大模型数据智能算法工程师招聘(可内推)
自动驾驶之心· 2025-09-15 23:33
Core Viewpoint - The article discusses the responsibilities and requirements for a position focused on developing advanced algorithms for large model data production, emphasizing the importance of data knowledge systems, automatic classification, authoritative evaluation sets, quality assessment, and innovative solutions in the field of artificial intelligence and deep learning [1][2][3]. Group 1: Responsibilities - The role involves designing and developing algorithms to address key issues in large model data production, including data knowledge system generation, automatic corpus classification, authoritative evaluation set construction, and quality assessment of training data [1][5]. - Specific tasks include researching automatic knowledge graph generation based on LLM, developing classification algorithms, and creating standardized evaluation sets to assess model performance [1][5]. - The position also requires establishing a data-driven system for quality assessment, identifying low-quality data, and synthesizing training data to improve model performance [1][5]. Group 2: Requirements - Candidates should possess a master's degree or higher in computer science, artificial intelligence, deep learning, or related fields, and be proficient in deep learning frameworks such as PyTorch and TensorFlow [2][6]. - Strong problem-solving skills, self-motivation, and the ability to analyze and address issues are essential, along with effective communication and coordination abilities [2][6]. - Preference is given to candidates with practical experience in large model data system design, corpus classification, evaluation set construction, and data annotation algorithms [3][4][6].
VLA空间理解的能力还远未被挖掘!OccVLA的新尝试(上海期智&清华&上交等)
自动驾驶之心· 2025-09-15 23:33
Core Insights - The article discusses the limitations of existing multimodal large language models (MLLMs) in robust 3D spatial understanding, which is crucial for autonomous driving [3][4] - It introduces OccVLA, a novel framework that integrates 3D occupancy representation into a unified multimodal reasoning process, enhancing the model's ability to learn fine-grained spatial structures from 2D visual inputs [3][9] Group 1: Introduction and Challenges - Recent advancements in end-to-end autonomous driving technology have highlighted the gap between 2D and 3D perception, which limits the widespread application of visual-language models (VLMs) in complex driving scenarios [4][5] - Two main challenges are identified: the difficulty in constructing usable and effective 3D representations without expensive manual annotations, and the lack of large-scale 3D visual-language pre-training that results in loss of fine-grained spatial details [5][8] Group 2: OccVLA Framework - OccVLA is designed to perform occupancy prediction, visual-language reasoning, and action generation tasks simultaneously, addressing the sparsity of occupancy representation and enhancing 3D understanding capabilities [9][18] - The framework employs a cross-attention mechanism to receive visual features from the VLM's intermediate layers, allowing for effective integration of occupancy tokens into the reasoning process without additional computational overhead [9][20] Group 3: Performance and Contributions - OccVLA has demonstrated superior performance in various perception and planning tasks, achieving state-of-the-art results on the nuScenes dataset for trajectory planning and 3D visual question answering [10][11] - The main contributions of the article include the introduction of the OccVLA framework, the design of a cross-modal attention mechanism that allows skipping the occupancy prediction process during inference, and the achievement of competitive results in trajectory planning tasks [11][36] Group 4: Experimental Results - The experiments utilized the nuScenes dataset, which includes 700 training scenes and 150 validation scenes, to evaluate the model's capabilities in 3D localization, target querying, and relational comparison tasks [35][36] - OccVLA's motion planning capabilities were compared with several baseline models, showing that it achieves optimal performance with only camera input and occupancy information as supervision, outperforming models that rely on more complex input data [37][38] Group 5: Visual Question Answering - The model was tested on the challenging NuScenes-QA benchmark dataset, demonstrating its ability to learn 3D understanding from pure visual input, surpassing larger models that depend on LiDAR data or explicit ground truth occupancy information [41][42] - The results indicate that OccVLA effectively integrates occupancy supervision to enhance its 3D reasoning capabilities in autonomous driving scenarios [41][45]
论文解读之港科PLUTO:首次超越Rule-Based的规划器!
自动驾驶之心· 2025-09-15 23:33
Core Viewpoint - The article discusses the development and features of the PLUTO model within the end-to-end autonomous driving domain, emphasizing its unique two-stage architecture and its direct encoding of structured perception outputs for downstream control tasks [1][2]. Summary by Sections Overview of PLUTO - PLUTO is characterized by its three main losses: regression loss, classification loss, and imitation learning loss, which collectively contribute to the model's performance [7]. - Additional auxiliary losses are incorporated to aid model convergence [9]. Course Introduction - The article introduces a new course titled "End-to-End and VLA Autonomous Driving," developed in collaboration with top algorithm experts from domestic leading manufacturers, aimed at addressing the challenges faced by learners in this rapidly evolving field [12][15]. Learning Challenges - The course addresses the difficulties learners face due to the fast-paced development of technology and the fragmented nature of knowledge across various domains, making it hard for beginners to grasp the necessary concepts [13]. Course Features - The course is designed to provide quick entry into the field, build a framework for research capabilities, and combine theory with practical applications [15][16][17]. Course Outline - The course consists of several chapters covering topics such as the history and evolution of end-to-end algorithms, background knowledge on various technologies, and detailed discussions on both one-stage and two-stage end-to-end methods [20][21][22][29]. Practical Application - The course includes practical assignments, such as RLHF fine-tuning, allowing students to apply their theoretical knowledge in real-world scenarios [31]. Instructor Background - The instructor, Jason, has a strong academic and practical background in cutting-edge algorithms related to end-to-end and large models, contributing to the course's credibility [32]. Target Audience and Expected Outcomes - The course is aimed at individuals with a foundational understanding of autonomous driving and related technologies, with the goal of elevating their skills to the level of an end-to-end autonomous driving algorithm engineer within a year [36].
关于大模型和自动驾驶的一切
自动驾驶之心· 2025-09-15 23:33
Group 1 - The article emphasizes the growing interest in large model technologies, particularly in areas such as RAG (Retrieval-Augmented Generation), AI Agents, multimodal large models (pre-training, fine-tuning, reinforcement learning), and optimization for deployment and inference [1] - A community named "Large Model Heart Tech" is being established to focus on these technologies and aims to become the largest domestic community for large model technology [1] - The community is also creating a knowledge platform to provide industry and academic information, as well as to cultivate talent in the field of large models [1] Group 2 - The article describes the community as a serious content-driven platform aimed at nurturing future leaders [2]
作为研究,VLA至少提供了一种摆脱无尽corner case的可能性!
自动驾驶之心· 2025-09-15 03:56
Core Viewpoint - VLA (Vision-Language-Action) is emerging as a mainstream keyword in autonomous driving, with new players rapidly entering the field and industrial production accelerating, while academia continues to innovate and compete [1][2]. Summary by Sections 1. VLA Research and Development - The VLA model represents a shift from traditional modular architectures to a unified end-to-end model that directly maps raw sensor inputs to driving control commands, addressing previous bottlenecks in autonomous driving technology [3][4]. - Traditional modular architectures (L2-L4) have clear advantages in terms of logic and independent debugging but suffer from cumulative error effects and information loss, making them less effective in complex traffic scenarios [4][5]. 2. VLA Model Advantages - The introduction of VLA models leverages the strengths of large language models (LLMs) to enhance interpretability, reliability, and the ability to generalize to unseen scenarios, thus overcoming limitations of earlier models [5][6]. - VLA models can explain their decision-making processes in natural language, improving transparency and trust in autonomous systems [5][6]. 3. Course Objectives and Structure - The course aims to provide a systematic understanding of VLA, helping participants develop practical skills in model design and research paper writing, while also addressing common challenges faced by newcomers in the field [6][7]. - The curriculum includes 12 weeks of online group research, followed by 2 weeks of paper guidance and 10 weeks of paper maintenance, focusing on both theoretical knowledge and practical coding skills [7][8]. 4. Enrollment and Requirements - The program is designed for a small group of 6 to 8 participants, targeting individuals with a foundational understanding of deep learning and basic programming skills [11][16]. - Participants are expected to engage actively in discussions and complete assignments on time, maintaining academic integrity throughout the course [20][29]. 5. Course Highlights - The course offers a comprehensive learning experience with a multi-faceted teaching approach, including guidance from experienced mentors and a structured evaluation system to track progress [23][24]. - Participants will gain access to essential resources, including datasets and baseline codes, to facilitate their research and experimentation [24][25].
SimpleVLA-RL:突破 VLA 模型训练瓶颈,RL实现端到端在线训练
自动驾驶之心· 2025-09-15 03:56
Core Insights - The article discusses the development of the SimpleVLA-RL framework, which enhances the training of Visual-Language-Action (VLA) models in robotics through reinforcement learning (RL) techniques, addressing key challenges in data scarcity and generalization capabilities [3][4][6]. Group 1: Research Background and Core Issues - VLA models are crucial for robotic manipulation, integrating visual perception, language understanding, and action generation, but current training methods face two main bottlenecks: data scarcity and weak generalization [4][6]. - The traditional training process relies heavily on large-scale human operation data, which is costly and difficult to scale, limiting model scalability [4][6]. - The article raises the question of whether RL can enhance the long-term action planning capabilities of VLA models, despite the unique challenges posed by VLA applications [4][6]. Group 2: SimpleVLA-RL Framework Contributions - SimpleVLA-RL is designed to improve VLA training efficiency, particularly in data-scarce environments, and has achieved state-of-the-art (SOTA) performance in benchmark tests like LIBERO and RoboTwin [7][8]. - The framework incorporates interactive trajectory sampling, parallel training across multiple environments, and a unified design for training, inference, and rendering, addressing the slow interaction and high cost issues of VLA models [7][8]. - It has demonstrated significant improvements in success rates across various tasks, such as increasing LIBERO's average success rate from 91.0% to 99.1% and RoboTwin 2.0 from 38.3% to 68.8% [7][8][14]. Group 3: Data Efficiency and Generalization - SimpleVLA-RL significantly reduces the dependency on large-scale demonstration data, achieving an average success rate of 96.9% with only one trajectory of demonstration data, surpassing the performance of full-trajectory supervised fine-tuning [19][20]. - The framework enhances the model's robustness across different scenes, objects, and tasks, demonstrating improved performance in unseen tasks compared to traditional methods [21][24]. Group 4: Real-World Deployment and Innovations - The framework has shown effective Sim-to-Real transfer, with real-world task success rates improving from 17.5% to 38.5% using only simulated data for training [24][27]. - A notable discovery is the "Pushcut" phenomenon, where the RL-trained model autonomously discovers more efficient strategies beyond human demonstrations, indicating a potential for innovative behavior in VLA models [25][30]. Group 5: Summary and Conclusions - SimpleVLA-RL addresses three core issues in VLA model training: reducing reliance on large-scale demonstration data, enhancing generalization capabilities, and achieving efficient Sim-to-Real transfer [31][32]. - The findings suggest that RL can enable VLA models to explore superior strategies, paving the way for future developments in autonomous and adaptive robotic systems [31][32].
过来人经验!研一进组后一定要着手准备小论文!
自动驾驶之心· 2025-09-14 23:33
Core Insights - The article emphasizes the importance of starting research early in the academic journey to secure resources and achieve results, which can lead to a competitive advantage in academia [1][2] - It highlights the benefits of publishing small papers early, including increased chances of scholarships, preparation for graduation theses, and enhanced research capabilities [4] Research Process - The suggested timeline for research includes: - Week 1: Determine research direction and select three potential topics - Weeks 2-3: Complete literature review and build paper framework - Weeks 4-6: Conduct experimental design and data collection - Weeks 7-8: Complete the first draft - Weeks 9-10: Finalize and polish the draft - Weeks 11-12: Select journals for submission and await acceptance [5] Service Offerings - The company provides personalized paper guidance services, including real-time interaction with mentors, unlimited access to recorded sessions, and 24-hour support [14] - It claims to have a high success rate, with a 96% acceptance rate for students over the past three years [6] Target Audience - The services are aimed at graduate students in computer science who are seeking to enhance their research output, gain experience, and improve their academic profiles for job applications or further studies [13][18]