自动驾驶之心
Search documents
VLA最新综述 | 中科院详解:面向具身操作的模型架构与演进
自动驾驶之心· 2025-08-30 16:03
Core Insights - The article discusses the emergence and development of Vision-Language-Action (VLA) models, which integrate visual perception, natural language understanding, and action control, marking a significant milestone in the pursuit of general robotic intelligence [3][5]. Development Stages - The development of VLA models is categorized into three stages: 1. **Emergence Stage**: Initial attempts to connect vision, language, and actions without a formal VLA concept, focusing on visual imitation learning and language annotation [7]. 2. **Exploration Stage**: By mid-2023, the VLA concept was formally introduced, with Transformer architecture becoming mainstream, enhancing model generalization in open scenarios [8]. 3. **Rapid Development Stage**: Since late 2024, VLA models have undergone rapid iterations, addressing generalization and inference efficiency issues, evolving from single-layer to multi-layer architectures [9]. Core Dimensions of VLA Models - VLA models consist of three main components: 1. **Observation Encoding**: Transitioning from CNN and RNN structures to unified architectures like ViT and cross-modal Transformers, incorporating multi-modal information for enhanced environmental perception [12]. 2. **Feature Inference**: The Transformer architecture has become the backbone, with new models like Diffusion Transformer and Mixture of Experts enhancing inference capabilities [14]. 3. **Action Decoding**: Evolving from discrete token representations to continuous control predictions, improving operational precision in real environments [15]. Training Data for VLA Models - VLA model training data is categorized into four types: 1. **Internet Image-Text Data**: Provides rich visual and linguistic priors but lacks dynamic environment understanding [17]. 2. **Video Data**: Contains temporal features of human activities, aiding in learning complex operational skills, though it often lacks precise action annotations [17]. 3. **Simulation Data**: Offers low-cost, scalable, and well-annotated data for pre-training and strategy exploration, but requires adaptation for real-world applications [19]. 4. **Real Robot Collected Data**: Directly reflects sensor noise and environmental complexities, crucial for enhancing VLA's generalization and reliability, albeit with high collection costs [19]. Pre-training and Post-training Methods - Common pre-training strategies include: 1. **Single Domain Data Training**: Early methods focused on single-modal data, providing initial perception and action representation capabilities [21]. 2. **Cross-domain Data Staged Training**: Models are pre-trained on large datasets before fine-tuning on robot operation data, effectively utilizing large-scale data priors [21]. 3. **Cross-domain Data Joint Training**: Simultaneously utilizes multiple data types to learn the relationships between perception, language, and actions [21]. 4. **Chain-of-Thought Enhancement**: Introduces reasoning chains to enable task decomposition and logical reasoning capabilities [21]. - Post-training methods aim to optimize pre-trained VLA models for specific tasks: 1. **Supervised Fine-tuning**: Uses labeled trajectory data for end-to-end training, enhancing action control mapping [22]. 2. **Reinforcement Fine-tuning**: Optimizes model strategies through interaction data, improving adaptability and performance [22]. 3. **Inference Expansion**: Enhances model performance through improved inference processes without modifying model parameters [22]. Evaluation of VLA Models - The evaluation framework for VLA models includes: 1. **Real-world Evaluation**: Tests model performance in real environments, providing reliable results but with high costs and low repeatability [24]. 2. **Simulator Evaluation**: Uses high-fidelity simulation platforms for testing, allowing for large-scale experiments but with potential discrepancies from real-world performance [24]. 3. **World Model Evaluation**: Employs learned environment models for virtual assessments, reducing costs but relying on the accuracy of the world model [24]. Future Directions for VLA Models - Future research on VLA models will focus on: 1. **Generalization Reasoning**: Enhancing the model's ability to adapt to unknown tasks and environments, integrating logical reasoning with robotic operations [26]. 2. **Fine-grained Operations**: Improving the model's capability to handle complex tasks by integrating multi-modal sensory information for precise interaction modeling [26]. 3. **Real-time Inference**: Addressing the need for efficient architectures and model compression to meet high-frequency control demands [27].
上岸自动驾驶感知!轨迹预测1v6小班课仅剩最后一个名额~
自动驾驶之心· 2025-08-30 16:03
Group 1 - The core viewpoint of the article emphasizes the importance of trajectory prediction in autonomous driving and related fields, highlighting that end-to-end methods are not yet widely adopted, and trajectory prediction remains a key area of research [1][3]. - The article discusses the integration of diffusion models into trajectory prediction, which significantly enhances multi-modal modeling capabilities, with specific models like Leapfrog Diffusion Model (LED) achieving real-time predictions and improving accuracy by 19-30 times on various datasets [2][3]. - The course aims to provide a systematic understanding of trajectory prediction, combining theoretical knowledge with practical coding skills, and assisting students in developing their own models and writing research papers [6][8]. Group 2 - The target audience for the course includes graduate students and professionals in trajectory prediction and autonomous driving, who seek to enhance their research capabilities and understand cutting-edge developments in the field [8][10]. - The course offers a comprehensive curriculum that includes classic and cutting-edge papers, baseline codes, and methodologies for selecting research topics, conducting experiments, and writing papers [20][30]. - The course structure includes 12 weeks of online group research followed by 2 weeks of paper guidance, ensuring participants gain practical experience and produce a research paper draft by the end of the program [31][35].
Tier 1一哥博世端到端终于走到量产,还是一段式!
自动驾驶之心· 2025-08-30 16:03
Core Viewpoint - The article discusses the advancements in autonomous driving technology, particularly focusing on WePilot AiDrive, a new end-to-end ADAS solution developed by WeRide, which aims to enhance the driving experience and safety through advanced AI capabilities [5][9][10]. Group 1: WeRide's New Technology - WeRide has launched a new end-to-end ADAS solution named WePilot AiDrive, which is set to be mass-produced within the year [5]. - The system integrates sensor data input and vehicle trajectory output into a single model, enhancing the efficiency and responsiveness of autonomous driving [10][24]. - The new system demonstrates improved performance in complex driving scenarios, such as navigating through urban villages and recognizing pedestrians in challenging lighting conditions [12][14][24]. Group 2: Comparison with Previous Systems - The previous two-stage model used separate perception and control models, which often led to data loss and limited understanding of driving environments [25][30]. - The new one-stage model allows for direct learning of the relationship between input data and output trajectories, significantly improving the system's performance [33]. - The transition from a rule-based approach to a more integrated model aims to overcome the limitations of earlier systems, which struggled with generalization and adaptability [32][35]. Group 3: Market Implications - The collaboration between WeRide and Bosch aims to make advanced driving capabilities accessible across various vehicle price segments, not just high-end models [41][44]. - Currently, less than 20% of vehicles in the Chinese market are equipped with advanced intelligent driving features, indicating significant growth potential for WeRide's technology [42]. - The goal is to push L2+ capabilities beyond the "value inflection point," making advanced driving technology more mainstream [44].
闭环端到端暴涨20%!华科&小米打造开源框架ORION
自动驾驶之心· 2025-08-30 16:03
Core Viewpoint - The article discusses the advancements in end-to-end (E2E) autonomous driving technology, particularly focusing on the introduction of the ORION framework, which integrates vision-language models (VLM) for improved decision-making in complex environments [3][30]. Summary by Sections Introduction - Recent progress in E2E autonomous driving technology faces challenges in complex closed-loop interactions due to limited causal reasoning capabilities [3][12]. - VLMs offer new hope for E2E autonomous driving but there remains a significant gap between VLM's semantic reasoning space and the numerical action space required for driving [3][17]. ORION Framework - ORION is proposed as an end-to-end autonomous driving framework that utilizes visual-language instructions for trajectory generation [3][18]. - The framework incorporates QT-Former for aggregating long-term historical context, VLM for scene understanding and reasoning, and a generative model to align reasoning and action spaces [3][16][18]. Performance Evaluation - ORION achieved a driving score of 77.74 and a success rate of 54.62% on the challenging Bench2Drive dataset, outperforming previous state-of-the-art (SOTA) methods by 14.28 points and 19.61% in success rate [5][24]. - The framework demonstrated superior performance in specific driving scenarios such as overtaking (71.11%), emergency braking (78.33%), and traffic sign recognition (69.15%) [26]. Key Contributions - The article highlights several key contributions of ORION: 1. QT-Former enhances the model's understanding of historical scenes by effectively aggregating long-term visual context [20]. 2. VLM enables multi-dimensional analysis of driving scenes, integrating user instructions and historical information for action reasoning [21]. 3. The generative model aligns the reasoning space of VLM with the action space for trajectory prediction, ensuring reasonable driving decisions in complex scenarios [22]. Conclusion - ORION provides a novel solution for E2E autonomous driving by achieving semantic and action space alignment, integrating long-term context aggregation, and jointly optimizing visual understanding and path planning tasks [30].
决定了!还是冲击自动驾驶算法
自动驾驶之心· 2025-08-30 04:03
Core Viewpoint - The article emphasizes the growing interest and opportunities in the autonomous driving sector, particularly in roles related to end-to-end systems, VLA (Vision-Language Alignment), and reinforcement learning, which are among the highest-paying positions in the AI industry [1][2]. Summary by Sections Community and Learning Resources - The "Autonomous Driving Heart Knowledge Planet" community has over 4,000 members and aims to grow to nearly 10,000 in the next two years, providing a platform for technical sharing and job-related discussions [1]. - The community offers a comprehensive collection of over 40 technical routes, including learning paths for end-to-end autonomous driving, VLA benchmarks, and practical engineering practices [2][5]. - Members can access a variety of resources, including video content, Q&A sessions, and practical problem-solving related to autonomous driving technologies [1][2]. Technical Learning and Career Development - The community provides structured learning paths for beginners, including full-stack courses suitable for those with no prior experience [7][9]. - There are mechanisms for job referrals within the community, connecting members with job openings in various autonomous driving companies [9][11]. - The community regularly engages with industry experts to discuss trends, technological advancements, and challenges in mass production [4][62]. Industry Insights and Trends - The article highlights the need for talent in the autonomous driving industry, particularly for tackling challenges related to L3/L4 level mass production [1]. - There is a focus on the importance of data set iteration speed in relation to technological advancements in the field, especially as AI enters the era of large models [63]. - The community aims to foster a complete ecosystem for autonomous driving, bringing together academic and industrial insights [12][64].
业务合伙人招募来啦!模型部署/VLA/端到端方向~
自动驾驶之心· 2025-08-29 16:03
Group 1 - The article announces the recruitment of 10 partners for the autonomous driving sector, focusing on course development, paper guidance, and hardware research [2][5] - The recruitment targets individuals with expertise in various advanced fields such as large models, multimodal models, and 3D target detection [3][4] - The article highlights the benefits of joining, including resource sharing for job seeking, PhD recommendations, and substantial cash incentives [5][6]
用QA问答详解端到端落地:[UniAD/PARA-Drive/SpareDrive/VADv2]
自动驾驶之心· 2025-08-29 16:03
Core Viewpoint - The article discusses various end-to-end models in autonomous driving, focusing on their architectures and functionalities, particularly the UniAD framework and its modular components for perception, prediction, and planning [4][13]. Group 1: End-to-End Models - End-to-end models are categorized into two types: completely black-box models like OneNet, which optimize the planner directly, and modular end-to-end models that reduce error accumulation through interactions between perception, prediction, and planning modules [3]. - The UniAD framework consists of four main parts: multi-view camera input, backbone for BEV feature extraction, perception for scene-level understanding, and prediction for multi-mode trajectory forecasting [4]. Group 2: Specific Model Architectures - TrackFormer utilizes three types of queries: detection, tracking, and ego queries, with a dynamic length for the tracking query set based on object disappearance [6]. - MotionFormer operates similarly to RNN structures, processing sequential blocks to predict future states based on previous outputs, focusing on agent-level knowledge [9]. - MapFormer employs Panoptic Segformer for environment segmentation, distinguishing between countable instances and uncountable elements [10]. Group 3: Advanced Techniques - PARA-Drive modifies the UniAD framework by adjusting the connections between perception, prediction, and planning modules, allowing for parallel training and improved inference speed [13]. - Symmetric sparse perception is divided into two parallel parts for agent detection and map perception, utilizing a DETR paradigm for both tasks [20]. - The planning transformer integrates various tokens to output action probabilities, selecting the most probable action based on human trajectory data [23]. Group 4: Community and Learning Resources - The article highlights the establishment of numerous technical discussion groups related to autonomous driving, covering over 30 learning paths and involving nearly 300 companies and research institutions [27][28].
华为坚定不走VLA路线,WA才是自动驾驶终极方案?
自动驾驶之心· 2025-08-29 16:03
Core Viewpoint - Huawei's automotive business has achieved significant milestones, including 1 million vehicles equipped with its driving technology and over 100 million units of laser radar shipped, showcasing its long-term strategic vision in the automotive sector [3][4]. Group 1: Achievements and Strategy - As of July, 1 million vehicles have been equipped with Huawei's QianKun intelligent driving system, and the cumulative mileage for assisted driving has reached 4 billion kilometers [3]. - Huawei's automotive business has been investing since 2014, focusing on R&D rather than immediate commercialization, which has led to current profitability [4][5]. - The company has launched 28 models in collaboration with various brands, indicating a strong market presence [3]. Group 2: Technology Approach - Huawei prefers the World Action (WA) model over the Video Language Action (VLA) model for achieving true autonomous driving, believing WA is a more direct and effective approach [5][13]. - The WA model processes information directly from various inputs like vision, sound, and touch, bypassing the need to convert data into language [5][14]. - Huawei has developed the WEWA model based on the WA architecture, which will be deployed in ADS 4.0 [6]. Group 3: Business Model and Pricing - Huawei's CEO emphasizes that there is no such thing as a free service in the automotive industry; costs are often hidden or transferred [7][17]. - The company believes charging for assisted driving systems is justified due to ongoing costs for updates and maintenance throughout the vehicle's lifecycle [8][18]. - Huawei's approach to lifecycle management ensures that users receive continuous upgrades, enhancing their experience over time [18]. Group 4: Future Plans - Huawei aims to achieve L3 capabilities for highway driving and L4 pilot capabilities in urban areas by 2026, with plans for large-scale commercial use by 2028 [11]. - The company is also working on transforming the intelligent cockpit into a "digital nanny," integrating AI to enhance user experience [11]. Group 5: Safety and Technology Enhancements - Huawei's increase in sensor configurations, such as additional laser radars, is driven by a commitment to safety rather than merely increasing product pricing [19][20]. - The company focuses on enhancing the precision of its systems to prevent accidents and improve user safety in various driving scenarios [20][22].
转行自动驾驶赛道?别瞎踩坑!这几个公众号码住少走 1 年弯路
自动驾驶之心· 2025-08-29 16:03
在自动驾驶技术飞速迭代的当下,你是否常常陷入这样的困境:想紧跟行业前沿动态,却总被海量碎片化信 息淹没,找不到精准聚焦的优质内容;想深入探究某个细分领域,比如感知算法、车路协同、法规标准,却 苦于找不到垂直深耕的专业平台;想和同领域从业者交流学习、拓展人脉,却困在单一社群的小圈子里,难 寻更广阔的交流空间。 今天,我们深知你的需求与困惑,特意联合了多家自动驾驶垂直领域的优质微信公众号,共同发起这场专属 从业者与爱好者的互推活动。这些公众号各自在自动驾驶的不同赛道上深耕多年,有的专注于技术解析,用 通俗语言拆解复杂算法;有的聚焦产业动态,第一时间捕捉行业政策与市场变化;有的侧重场景应用,带你 领略自动驾驶在物流、出行等领域的落地成果。在这里,你无需再花费大量时间筛选信息,只需轻轻点击关 注,就能一次性解锁多个专业视角,获取更全面、更深入、更有价值的自动驾驶内容,还能结识更多同频伙 伴,拓宽行业视野。接下来,就让我们一起认识这些宝藏公众号吧! 700+全球汽车法规标准解读&智驾知识入门科普,20000+从业者关注参与。 专注智能驾驶与全球汽车政策、法规、标准领域,以专业、前沿的内容为特色,深受汽车爱好者和行业从 ...
机器人offer收割机,这个具身领域的黄埔军校不简单......
自动驾驶之心· 2025-08-29 10:26
Core Viewpoint - The article highlights the growth and development of the "Embodied Intelligence Knowledge Planet," a community focused on embodied intelligence, which has seen an increase in members and successful job placements in the field [1][2]. Community Development - The community has nearly 2000 members and aims to reach 10,000 in the next two years, providing a platform for knowledge sharing and technical discussions [1][2]. - It offers various resources including video tutorials, Q&A sessions, and job exchange opportunities, addressing practical issues faced by members [1][2][4]. Educational Resources - The community has compiled over 30 technical routes for members, covering topics such as robot simulation, data collection, and various learning methodologies [2][13]. - It provides a comprehensive list of open-source projects, datasets, and industry reports related to embodied intelligence, facilitating easier access to information for both beginners and advanced researchers [13][20][27]. Networking and Job Opportunities - The community has established a job referral mechanism with several leading companies in the field, allowing members to connect with potential employers [6][14]. - Members can engage with industry experts through forums and live sessions, enhancing their understanding of current trends and job market dynamics [4][14]. Technical Focus Areas - The community covers a wide range of technical topics, including reinforcement learning, multi-modal models, and robotic navigation, providing structured learning paths for various interests [13][40][65]. - It emphasizes the importance of practical applications in the industry, offering insights into the latest advancements and challenges in embodied intelligence [2][20][46].