Workflow
VLM
icon
Search documents
Vision-Zero:零数据VLM自我进化!陈怡然团队提出零监督训练新范式
机器之心· 2025-10-11 03:29
Core Insights - The article discusses the development of Vision-Zero, a self-play framework designed for Vision-Language Models (VLM), which aims to overcome the limitations of traditional training methods that rely heavily on human-annotated data and reinforcement learning rewards [6][7][26]. Background - VLMs have shown impressive performance in multimodal tasks, but they face challenges such as data scarcity due to high annotation costs and a knowledge ceiling that limits model capabilities [6]. - The Vision-Zero framework introduces a self-play strategy that allows VLMs to generate complex reasoning data autonomously, eliminating the need for manual annotation [6]. Framework Characteristics - Vision-Zero employs a self-play framework based on social reasoning games, enabling agents to generate high-complexity reasoning data during self-play [6]. - It allows any form of image as input, enhancing the model's ability to generalize across various domains [6]. - The framework incorporates an iterative self-play policy optimization algorithm that addresses performance bottlenecks common in traditional self-play methods [7]. Game Design - Inspired by social reasoning games, Vision-Zero includes a set of rules where agents must deduce hidden roles based on subtle differences in images, fostering complex reasoning chains [12][15]. - The game requires only two images with slight differences, making data construction simple and cost-effective [17]. Training Methodology - The framework utilizes a dual-phase alternating training approach to avoid local equilibrium and knowledge saturation, enhancing the model's ability to explore new reasoning paths [20]. - This method has shown to significantly outperform single-phase training in various tasks [20]. Experimental Results - Vision-Zero demonstrates strong task generalization capabilities, outperforming state-of-the-art methods that require annotated data across multiple benchmark datasets [22]. - The models trained under Vision-Zero effectively mitigate negative transfer issues commonly seen in VLMs, maintaining performance across different tasks [24]. Implications - Vision-Zero illustrates the feasibility and potential of self-play in transitioning from single-task to general-task applications, breaking free from the constraints of manual annotation and knowledge limitations [26].
自动驾驶VLA发展到哪个阶段了?现在还适合搞研究吗?
自动驾驶之心· 2025-09-22 08:04
Core Insights - The article discusses the transition in intelligent driving technology from rule-driven to data-driven approaches, highlighting the emergence of VLA (Vision-Language Action) as a more straightforward and effective method compared to traditional end-to-end systems [1][2] - The challenges in the current VLA technology stack are emphasized, including the complexity and fragmentation of knowledge, which makes it difficult for newcomers to enter the field [2][3] - A new practical course on VLA has been developed to address these challenges, providing a structured learning path for students interested in advanced knowledge in autonomous driving [3][4][5] Summary by Sections Introduction to VLA - The article introduces VLA as a significant advancement in autonomous driving, offering a cleaner approach than traditional end-to-end systems, while also addressing corner cases more effectively [1] Challenges in Learning VLA - The article outlines the difficulties faced by learners in navigating the complex and fragmented knowledge landscape of VLA, which includes a plethora of algorithms and a lack of high-quality documentation [2] Course Development - A new course titled "Autonomous Driving VLA Practical Course" has been created to provide a comprehensive overview of the VLA technology stack, aiming to facilitate easier entry into the field for students [3][4] Course Features - The course is designed to address key pain points, offering quick entry into the subject matter through accessible language and examples [3] - It aims to build a framework for understanding VLA research and enhance research capabilities by teaching students how to categorize papers and extract innovative points [4] - The course includes practical components to ensure that theoretical knowledge is effectively applied in real-world scenarios [5] Course Outline - The course covers various topics, including the origins of VLA, foundational algorithms, and the differences between modular and integrated VLA systems [6][15][19][20] - It also includes practical coding exercises and projects to reinforce learning and application of concepts [22][24][26] Instructor Background - The course is led by experienced instructors with a strong background in multi-modal perception, autonomous driving, and large model frameworks, ensuring high-quality education [27] Learning Outcomes - Upon completion, students are expected to have a thorough understanding of current advancements in VLA, core algorithms, and the ability to apply their knowledge in practical settings [28][29]
小鹏&理想全力攻坚的VLA路线,到底都有哪些研究方向?
自动驾驶之心· 2025-09-17 23:33
Core Viewpoint - The article discusses the transition in intelligent driving technology from rule-driven to data-driven approaches, highlighting the limitations of end-to-end models in complex scenarios and the potential of VLA (Vision-Language Action) as a more streamlined solution [1][2]. Group 1: Challenges in Learning and Research - The technical stack for autonomous driving VLA has not yet converged, leading to a proliferation of algorithms and making it difficult for newcomers to enter the field [2]. - A lack of high-quality documentation and fragmented knowledge in various domains increases the entry barrier for beginners in autonomous driving VLA research [2]. Group 2: Course Development - A new course titled "Autonomous Driving VLA Practical Course" has been developed to address the challenges faced by learners, focusing on a comprehensive understanding of the VLA technical stack [3][4]. - The course aims to provide a one-stop opportunity to enhance knowledge across multiple fields, including visual perception, language modules, and action modules, while integrating cutting-edge technologies [2][3]. Group 3: Course Features - The course emphasizes quick entry into the subject matter through a Just-in-Time Learning approach, using simple language and case studies to help students grasp core technologies rapidly [3]. - It aims to build a framework for research capabilities, enabling students to categorize papers and extract innovative points to form their own research systems [4]. - Practical application is a key focus, with hands-on sessions designed to complete the theoretical-to-practical loop [5]. Group 4: Course Outline - The course covers the origins of autonomous driving VLA, foundational algorithms, and the differences between modular and integrated VLA [6][10][12]. - It includes practical sessions on dataset creation, model training, and performance enhancement, providing a comprehensive learning experience [12][14][16]. Group 5: Instructor Background - The instructors have extensive experience in multimodal perception, autonomous driving VLA, and large model frameworks, with numerous publications in top-tier conferences [22]. Group 6: Learning Outcomes - Upon completion, students are expected to thoroughly understand the current advancements in autonomous driving VLA and master core algorithms [23][24]. - The course is designed to benefit students in internships, job recruitment, and further academic pursuits in the field [26]. Group 7: Course Schedule - The course is set to begin on October 20, with a structured timeline for unlocking chapters and providing support through online Q&A sessions [27].
自动驾驶中有“纯血VLA"吗?盘点自动驾驶VLM到底能起到哪些作用~
自动驾驶之心· 2025-09-06 16:05
Core Viewpoint - The article discusses the challenges and methodologies involved in developing datasets for autonomous driving, particularly focusing on the VLA (Visual Language Action) model and its applications in trajectory prediction and scene understanding [1]. Dataset Handling - Different datasets have varying numbers of cameras, and the VLM model can handle this by automatically processing different image token inputs without needing explicit camera counts [2] - The output trajectories are based on the vehicle's current coordinate system, with predictions given as relative (x, y) values rather than image coordinates, requiring additional camera parameters for mapping to images [6] - The VLA model's output format is generally adhered to, but occasional discrepancies occur, which are corrected through Python programming for format normalization [8][9] Trajectory Prediction - VLA trajectory prediction differs from traditional methods by incorporating scene understanding capabilities through QA training, enhancing the model's ability to predict trajectories of dynamic objects like vehicles and pedestrians [11] - The dataset construction faced challenges such as data quality issues and inconsistencies in coordinate formats, which were addressed through rigorous data cleaning and standardization processes [14][15] Data Alignment and Structure - Data alignment is achieved by converting various dataset formats into a unified relative displacement in the vehicle's coordinate system, organized in a QA format that includes trajectory prediction and dynamic object forecasting [18] - The input data format consists of images and trajectory points from the previous 1.5 seconds to predict future trajectory points over 5 seconds, adhering to the SANA standard [20] Community and Resources - The "Autonomous Driving Heart Knowledge Planet" community focuses on cutting-edge technologies in autonomous driving, covering nearly 40 technical directions and fostering collaboration between industry and academia [22][24] - The community offers a comprehensive platform for learning, including video tutorials, Q&A sessions, and job opportunities in the autonomous driving sector [28][29]
端到端VLA的起点:聊聊大语言模型和CLIP~
自动驾驶之心· 2025-08-19 07:20
Core Viewpoint - The article discusses the development and significance of end-to-end (E2E) algorithms in autonomous driving, emphasizing the integration of various advanced technologies such as large language models (LLMs), diffusion models, and reinforcement learning (RL) in enhancing the capabilities of autonomous systems [21][31]. Summary by Sections Section 1: Overview of End-to-End Autonomous Driving - The first chapter provides a comprehensive overview of the evolution of end-to-end algorithms, explaining the transition from modular approaches to end-to-end solutions, and discussing the advantages and challenges of different paradigms [40]. Section 2: Background Knowledge - The second chapter focuses on the technical stack associated with end-to-end systems, detailing the importance of LLMs, diffusion models, and reinforcement learning, which are crucial for understanding the future job market in this field [41][42]. Section 3: Two-Stage End-to-End Systems - The third chapter delves into two-stage end-to-end systems, exploring their emergence, advantages, and disadvantages, while also reviewing notable works in the field such as PLUTO and CarPlanner [42][43]. Section 4: One-Stage End-to-End and VLA - The fourth chapter highlights one-stage end-to-end systems, discussing various subfields including perception-based methods and the latest advancements in VLA (Vision-Language Alignment), which are pivotal for achieving the ultimate goals of autonomous driving [44][50]. Section 5: Practical Application and RLHF Fine-Tuning - The fifth chapter includes a major project focused on RLHF (Reinforcement Learning from Human Feedback) fine-tuning, providing practical insights into building pre-training and reinforcement learning modules, which are applicable to VLA-related algorithms [52]. Course Structure and Learning Outcomes - The course aims to equip participants with a solid understanding of end-to-end autonomous driving technologies, covering essential frameworks and methodologies, and preparing them for roles in the industry [56][57].
自动驾驶秋招交流群成立了!
自动驾驶之心· 2025-08-18 23:32
Core Viewpoint - The article emphasizes the convergence of autonomous driving technology, indicating a shift from numerous diverse approaches to a more unified model, which raises the technical barriers in the industry [1] Group 1 - The industry is witnessing a trend where previously many directions requiring algorithm engineers are now consolidating into unified models such as one model, VLM, and VLA [1] - The article encourages the establishment of a large community to support individuals in the industry, highlighting the limitations of individual efforts [1] - A new job and industry-related community is being launched to facilitate discussions on industry trends, company developments, product research, and job opportunities [1]
车企、科技企业VLA研发进展
Group 1: Li Auto - Li Auto's i8 features the VLA "driver model," marking a significant advancement in intelligent driving following the previous VLM introduction [1] - The VLA model includes a newly designed spatial encoder that utilizes language models and logical reasoning to provide driving decisions, predicting trajectories of other vehicles and pedestrians through a diffusion model [1] - The inference frame rate of the VLA is approximately 10 Hz, more than tripling the previous VLM's rate of 3 Hz [1] Group 2: XPeng Motors - XPeng G7 officially commenced deliveries on July 7, with a clear timeline for the Ultra version's VLA and VLM software updates [2] - The VLA software OTA update is scheduled for September 2025, with VLM software upgrades following in November 2025, and personalized recommendations by December 2025 [2] - The XPeng G7 Ultra version is equipped with three self-developed Turing AI chips, boasting a total computing power of 2250 TOPS, positioning it as a leader among mass-produced models [2] Group 3: Chery Automobile - Chery plans to introduce the VLA and world model technology into fuel vehicles by 2025 through its Falcon 900 intelligent driving system, aiming to set a new benchmark for "oil-electric intelligence" [3] - The Falcon 900 system utilizes a self-developed VLA model that integrates visual perception, language understanding, and action execution [3] - The model has been trained on 20 million kilometers of real-world data, capable of understanding over 5000 traffic scenarios, achieving a 92% accuracy rate in recognizing non-standard traffic signals in complex urban conditions, a 37% improvement over traditional systems [3] Group 4: Geely Automobile - Geely is actively developing VLA technology, integrating it with world models to create a comprehensive world model system [4] - The Qianli Haohan system features a "dual end-to-end model" design, enabling a multi-modal VLA general scene model and an end-to-end model to back each other up [4] - This system is powered by dual NVIDIA Thor chips, with a total computing power of 1400 TOPS and over 40 perception units capable of detecting objects 0.75 meters in size from 300 meters away [4] Group 5: Yuanrong Qihang - Yuanrong Qihang is also investing in the VLA model, with five models expected to feature it by the third quarter of this year [5] - The company was among the earliest to publicly announce its VLA development in June of last year [5] - The VLA model focuses on defensive driving with four core functions: spatial semantic understanding, recognition of irregular obstacles, comprehension of text-based guide signs, and voice control of the vehicle, which will be gradually released with mass production [5]
自动驾驶秋招&社招求职群成立了!
自动驾驶之心· 2025-08-04 23:33
Core Viewpoint - The article emphasizes the convergence of autonomous driving technology, highlighting the shift from numerous diverse approaches to a more unified model, which indicates higher technical barriers in the industry [1] Group 1 - The industry is moving towards a unified solution with models like one model, VLM, and VLA, suggesting a reduction in the need for numerous algorithm engineers [1] - The article encourages the establishment of a large community to support industry professionals, facilitating growth and collaboration among peers [1] - A new job-related community is being launched to discuss industry trends, company developments, product research, and job opportunities [1]
开课倒计时!国内首个自动驾驶端到端项目级教程来啦~
自动驾驶之心· 2025-08-02 06:00
Core Viewpoint - End-to-end (E2E) autonomous driving is currently the core algorithm for mass production in intelligent driving, with significant advancements in the VLM/VLA systems leading to high demand for related positions and salaries reaching up to 1 million annually [2][11]. Group 1: Industry Trends - The concept of E2E has evolved significantly, with various technical schools emerging, yet many still struggle to understand its workings and distinctions between single-stage and two-stage approaches [2][4]. - The introduction of VLA (Vision-Language Architecture) is seen as a new frontier in autonomous driving, with companies actively researching and developing new generation mass production solutions [21][22]. Group 2: Educational Initiatives - A new course titled "End-to-End and VLA Autonomous Driving" has been launched to address the challenges faced by newcomers in the field, focusing on practical applications and theoretical foundations [14][27]. - The course aims to provide a comprehensive understanding of E2E autonomous driving, covering various models and methodologies, including diffusion models and reinforcement learning [6][19][21]. Group 3: Job Market Insights - The job market for VLA/VLM algorithm experts is robust, with salaries for positions requiring 3-5 years of experience ranging from 40K to 70K monthly, indicating a strong demand for skilled professionals [11][12]. - Positions such as VLA model quantization deployment engineers and multi-modal VLA model direction experts are particularly sought after, reflecting the industry's shift towards advanced algorithmic solutions [11][12].
秋招正当时!自动驾驶之心求职交流群来啦~
自动驾驶之心· 2025-07-28 03:15
Group 1 - The article highlights the growing anxiety among job seekers, particularly students and professionals looking to transition into new fields, driven by the desire for better opportunities [1] - It notes that the landscape of autonomous driving technology is becoming more standardized, with a shift from numerous directions requiring algorithm engineers to a focus on unified models like one model, VLM, and VLA, indicating higher technical barriers [1] - The article emphasizes the importance of community building to support individuals in their career growth and industry knowledge, leading to the establishment of a job-related community for discussions on industry trends, company developments, and job opportunities [1]