自动驾驶之心
Search documents
深度综述 | 300+论文带你看懂:纯视觉如何将VLA推向自动驾驶和具身智能巅峰!
自动驾驶之心· 2025-09-24 23:33
Core Insights - The emergence of Vision Language Action (VLA) models signifies a paradigm shift in robotics from traditional strategy-based control to general-purpose robotic technology, transforming Vision Language Models (VLMs) from passive sequence generators to active agents capable of executing operations and making decisions in complex, dynamic environments [1][5][11] Summary by Sections Introduction - Robotics has historically relied on pre-programmed instructions and control strategies for task execution, primarily in simple, repetitive tasks [5] - Recent advancements in AI and deep learning have enabled the integration of perception, detection, tracking, and localization technologies, leading to the development of embodied intelligence and autonomous driving [5] - Current robots often operate as "isolated agents," lacking effective interaction with humans and external environments, prompting researchers to explore the integration of Large Language Models (LLMs) and VLMs for more precise and flexible robotic operations [5][6] Background - The development of VLA models marks a significant step towards general embodied intelligence, unifying visual perception, language understanding, and executable control within a single modeling framework [11][16] - The evolution of VLA models is supported by breakthroughs in single-modal foundational models across computer vision, natural language processing, and reinforcement learning [13][16] VLA Models Overview - VLA models have rapidly developed due to advancements in multi-modal representation learning, generative modeling, and reinforcement learning [24] - The core design of VLA models includes the integration of visual encoding, LLM reasoning, and decision-making frameworks, aiming to bridge the gap between perception, understanding, and action [23][24] VLA Methodologies - VLA methods are categorized into five paradigms: autoregressive, diffusion models, reinforcement learning, hybrid methods, and specialized approaches, each with distinct design motivations and core strategies [6][24] - Autoregressive models focus on sequential generation of actions based on historical context and task instructions, demonstrating scalability and robustness [26][28] Applications and Resources - VLA models are applicable in various robotic domains, including robotic arms, quadrupedal robots, humanoid robots, and wheeled robots (autonomous vehicles) [7] - The development of VLA models heavily relies on high-quality datasets and simulation platforms to address challenges related to data scarcity and high risks in real-world testing [17][21] Challenges and Future Directions - Key challenges in the VLA field include data limitations, reasoning speed, and safety concerns, which need to be addressed to accelerate the development of VLA models and general robotic technologies [7][18] - Future research directions are outlined to enhance the capabilities of VLA models, focusing on improving data diversity, enhancing reasoning mechanisms, and ensuring safety in real-world applications [7][18] Conclusion - The review emphasizes the need for a clear classification system for pure VLA methods, highlighting the significant features and innovations of each category, and providing insights into the resources necessary for training and evaluating VLA models [9][24]
西交利物浦&港科最新!轨迹预测基座大模型综述
自动驾驶之心· 2025-09-24 23:33
Core Insights - The article discusses the application of large language models (LLMs) and multimodal large language models (MLLMs) in the paradigm shift for autonomous driving trajectory prediction, enhancing the understanding of complex traffic scenarios to improve safety and efficiency [1][20]. Summary by Sections Introduction and Overview - The integration of LLMs into autonomous driving systems allows for a deeper understanding of traffic scenarios, transitioning from traditional methods to LFM-based approaches [1]. - Trajectory prediction is identified as a core technology in autonomous driving, utilizing historical data and contextual information to infer future movements of traffic participants [5]. Traditional Methods and Challenges - Traditional vehicle trajectory prediction methods include physics-based approaches (e.g., Kalman filters) and machine learning methods (e.g., Gaussian processes), which struggle with complex interactions [8]. - Deep learning methods improve long-term prediction accuracy but face challenges such as high computational demands and poor interpretability [9]. - Reinforcement learning methods excel in interactive scene modeling but are complex and unstable [9]. LLM-Based Vehicle Trajectory Prediction - LFM introduces a paradigm shift by discretizing continuous motion states into symbolic sequences, leveraging LLMs' semantic modeling capabilities [11]. - Key applications of LLMs include trajectory-language mapping, multimodal fusion, and constraint-based reasoning, enhancing interpretability and robustness in long-tail scenarios [11][13]. Evaluation Metrics and Datasets - The article categorizes datasets for pedestrian and vehicle trajectory prediction, highlighting the importance of datasets like Waymo and ETH/UCY for evaluating model performance [16]. - Evaluation metrics for vehicles include L2 distance and collision rates, while pedestrian metrics focus on minADE and minFDE [17]. Performance Comparison - A performance comparison of various models on the NuScenes dataset shows that LLM-based methods significantly reduce collision rates and improve long-term prediction accuracy [18]. Discussion and Future Directions - The widespread application of LFMs indicates a shift from local pattern matching to global semantic understanding, enhancing safety and compliance in trajectory generation [20]. - Future research should focus on developing low-latency inference techniques, constructing motion-oriented foundational models, and advancing world perception and causal reasoning models [21].
华为坚定要走的世界模型路线,到底是什么?
自动驾驶之心· 2025-09-24 23:33
一、引言 世界建模已成为人工智能(AI)与机器人领域的一项基础性任务,其核心目标是使智能体具备理解、表示并预测其所处动态环境的能力。近年来,生成 式建模技术(包括变分自编码器(VAEs)、生成对抗网络(GANs)、扩散模型(diffusion models)和自回归模型(autoregressive models))取得了显 著进展,通过实现复杂的生成与预测能力,极大地丰富了该领域的研究内容。 然而,这些进展在很大程度上集中于2D数据,主要是图像或视频。与之形成对比的是,现实世界场景本质上处于3D空间中且具有动态特性,通常需要 利用原生3D与4D表示的模型。这类表示包括RGB-D图像、占用网格、激光雷达点云,以及能够捕捉时间动态的时序形式。这些模态可提供明确的几何 信息和物理基础,对于自主驾驶、机器人等嵌入式系统(embodied systems)和安全关键系统(safety-critical systems)而言至关重要。 除上述原生格式外,世界建模的研究也已拓展至相邻领域。部分研究关注视频、全景或基于网格(mesh)的数据,此类系统具备大规模、通用的视频- 网格生成能力;与此同时,另一类研究聚焦于3D物体 ...
为什么 VLA 能叠毛巾,却测不准物体位姿?
自动驾驶之心· 2025-09-24 23:33
Core Viewpoint - The article discusses the breakthrough solution OnePoseViaGen, which addresses the challenges of 6D object pose estimation in robotics, enabling robots to interact with unknown objects effectively without relying on pre-existing 3D models [3][4]. Group 1: Challenges in Current Robotics - Existing models like VLA can perform tasks that do not require precise spatial positioning but struggle with tasks that require 6D pose support, such as grasping unfamiliar objects [3]. - The inability to establish a closed-loop connection between generated models, real objects, and spatial poses is a fundamental limitation in current robotic interactions with the physical world [3][4]. Group 2: OnePoseViaGen Framework - OnePoseViaGen offers a revolutionary approach that estimates the 6D pose of unknown objects using only a single reference image, without needing pre-set 3D models [3][4]. - The framework follows a logical progression: first addressing the absence of 3D models, then calibrating real-world scale and pose, and finally reducing domain gaps to enhance robustness [6][8]. Group 3: Key Research Outcomes - The framework's first step involves generating a 3D texture model from a single RGB-D anchor image, ensuring geometric consistency through normal vector estimation [9][10]. - A two-step alignment strategy is employed to refine the scale and pose, starting with a rough alignment followed by a precise optimization process [11][13][14]. - The final step incorporates text-guided generative domain randomization to enhance the model's robustness against variations in texture, lighting, and occlusion [15][16]. Group 4: Performance Validation - OnePoseViaGen outperforms existing methods on various benchmarks, achieving an average ADD of 81.27% and ADD-S of 93.10%, significantly higher than competitors like Oryon and Any6D [17][18]. - In high-challenge scenarios, such as grasping tasks, OnePoseViaGen maintains high accuracy where other methods fail, demonstrating its effectiveness in real-world applications [17][18]. Group 5: Real-World Application - The framework was tested in real-world robotic tasks, achieving a success rate of 73.3% in grasping and placement tasks, far exceeding baseline methods [22][24]. - The qualitative results show that the generated models closely match real object textures and structures, allowing for precise pose estimation even in the presence of occlusions [26]. Group 6: Ablation Studies - Ablation studies confirm the necessity of the coarse-to-fine alignment and generative domain randomization modules, highlighting their critical roles in enhancing the method's robustness [27][29]. Group 7: Conclusion - OnePoseViaGen represents the first pipeline that integrates single-image 3D generation with pose estimation, proving that generative modeling can directly enhance pose estimation performance without relying on 3D models or multi-view inputs [30].
基于模仿学习的端到端决定了它的上限不可能超越人类
自动驾驶之心· 2025-09-24 06:35
Core Viewpoint - The article discusses the evolution of end-to-end (E2E) autonomous driving technology, emphasizing the transition from rule-based to data-driven approaches, and highlights the limitations of current models in handling complex scenarios. It introduces Visual Language Models (VLM) and Visual Language Agents (VLA) as potential solutions to enhance the capabilities of autonomous driving systems [2][3]. Summary by Sections Introduction to VLA - VLA represents a shift from merely imitating human behavior to understanding and interacting with the physical world, addressing the limitations of traditional E2E models in complex driving scenarios [2]. Challenges in Autonomous Driving - The VLA technology stack is still evolving, with numerous algorithms emerging, indicating a lack of convergence in the field [3]. Course Overview - A course titled "Autonomous Driving VLA and Large Model Practical Course" is being prepared to address various aspects of VLA, including its origins, algorithms, and practical applications [5]. Learning Objectives - The course aims to provide a comprehensive understanding of VLA, covering topics such as data set creation, model training, and performance enhancement [5][17]. Course Structure - The course is structured into several chapters, each focusing on different aspects of VLA, including algorithm introduction, foundational knowledge, VLM as an interpreter, modular and integrated VLA, reasoning enhancement, and practical assignments [20][26][31][34][36]. Instructor Background - The instructors have extensive experience in multimodal perception, autonomous driving, and large model frameworks, contributing to the course's credibility [38]. Expected Outcomes - Participants are expected to gain a thorough understanding of current advancements in VLA, master core algorithms, and be able to apply their knowledge in practical settings [39][40]. Course Schedule - The course is set to begin on October 20, with a structured timeline for each chapter's release [43].
等了大半年的Qwen3-VL终于也开源了!
自动驾驶之心· 2025-09-24 06:35
Core Viewpoint - The article discusses the recent open-source release of various AI models, particularly focusing on the Qwen3-VL model, highlighting its improvements over previous versions and its performance in various tasks. Model Improvements - The Qwen3-VL model has made significant enhancements compared to the Qwen2.5-VL model, including changes in the vision encoder, projector, and LLM decoder components. The patch size increased from 14 to 16, and the activation function was changed from silu to gelu_pytorch_tanh [6][7]. - The model now incorporates a DeepStack in the projector, integrating features from multiple layers of the vision encoder into the LLM [6]. Performance Metrics - The Qwen3-VL model's text capabilities are comparable to the Qwen3-235B-A22B model, with various performance metrics listed in a comparative table against other leading models [10]. - In specific tasks, Qwen3-VL demonstrated superior performance in OCR recognition, table recognition, and understanding complex visual tasks compared to mainstream open-source models [11][13][17]. Task-Specific Results - The model showed strong capabilities in recognizing handwritten text and extracting information from complex images, outperforming previous versions and other models in accuracy [11][13]. - In table recognition tasks, Qwen3-VL successfully extracted and formatted data into HTML, demonstrating its ability to follow instructions accurately [17][18]. Overall Assessment - The Qwen3-VL model is positioned as a top-tier visual language model, with substantial improvements in various capabilities, including data extraction, reasoning, and visual understanding tasks [14][30]. - The article concludes with a positive outlook on the model's performance, indicating a significant leap forward in the capabilities of visual language models [106].
自动驾驶之心国庆&中秋节活动开始了(课程八折/星球七折/辅导/硬件优惠)
自动驾驶之心· 2025-09-24 04:00
Group 1 - The article promotes various discounts and offers for courses, including a 70% discount and a reduction of 80 or 99 yuan for specific courses [1][3][4] - A yearly subscription to the "Big Model Planet" is available for 99 yuan, which includes technology, industry insights, and job-seeking assistance [1] - The platform offers a 1v1 tutoring service with a maximum discount of 1000 yuan off a 5000 yuan fee, and a 1v6 paper tutoring service with a 1000 yuan reduction [1] Group 2 - The "Autonomous Driving Heart" section highlights cutting-edge self-driving technology, featuring nearly 40 learning routes including VLA, world models, and closed-loop simulation [6] - The community facilitates face-to-face interactions with industry leaders and academic experts, providing insights into the future direction of autonomous driving [6] - The article mentions seven premium courses aimed at beginners, focusing on essential skills and knowledge in the field [6]
打算招聘几位大佬共创平台(4D标注/世界模型/VLA等方向)
自动驾驶之心· 2025-09-23 23:32
Core Viewpoint - The article discusses the recruitment of business partners for the autonomous driving sector, emphasizing the need for expertise in various advanced technologies and offering attractive incentives for potential candidates [2][3][5]. Group 1: Recruitment Details - The company plans to recruit 10 outstanding partners for autonomous driving-related course development, paper guidance, and hardware research [2]. - Candidates with expertise in areas such as large models, multimodal models, diffusion models, and 3D target detection are particularly welcome [3]. - Preferred qualifications include a master's degree or higher from universities ranked within the QS200, with priority given to candidates who have published in top conferences [4]. Group 2: Incentives and Opportunities - The company offers resource sharing related to autonomous driving, including job recommendations, PhD opportunities, and study abroad guidance [5]. - Attractive cash incentives and opportunities for collaboration on entrepreneurial projects are part of the recruitment package [5].
什么样的技术才能成就一家顶流自动驾驶公司?
自动驾驶之心· 2025-09-23 23:32
Core Viewpoint - The article discusses the evolution of autonomous driving technology, highlighting the competitive landscape among major tech companies, automakers, and startups, and how advancements are reshaping transportation methods [2][3]. Group 1: Tesla's Development - Tesla is recognized as a pioneer in autonomous driving, with its aggressive data-driven approach that discards traditional methods like LiDAR and high-definition maps in favor of pure visual perception [6]. - The development path includes the transition from modular designs to end-to-end neural networks, aiming to make AI think and drive like humans [6]. - Key technologies introduced include BEV (Bird's Eye View) and Occupancy Network, enhancing spatial awareness and reducing reliance on high-definition maps [8][12]. Group 2: Huawei's Progress - Huawei's ADS technology has evolved from multi-sensor fusion and high-definition map reliance to a "mapless" approach, enhancing perception algorithms and ultimately leading to end-to-end model applications [23]. - The ADS 1.0 version relied on multiple sensors and high-definition maps, while ADS 2.0 marked a breakthrough in "mapless" driving [25][26]. - The latest ADS 3.0 aims for full scene intelligent driving, integrating advanced perception networks and optimizing hardware for better performance [28]. Group 3: Momenta's Strategy - Momenta employs a dual strategy of data-driven algorithms and mass production of autonomous driving products, creating a feedback loop for continuous improvement [33]. - The company focuses on low-cost automated mapping and crowd-sourced map updates, enhancing its capabilities in complex environments [35]. Group 4: Horizon's Path - Horizon has developed a unique path from automotive-grade AI chips to full-stack solutions, emphasizing software and hardware collaboration for efficiency [47]. - The company has progressively advanced from early ADAS prototypes to L2+ and L3 capabilities, with plans for broader applications in 2025 [49][50]. Group 5: Xiaopeng's Evolution - Xiaopeng's autonomous driving journey reflects a shift from multi-sensor fusion and high-definition maps to a "mapless" approach, driven by AI large models [79]. - The XPILOT series has evolved from basic parking assistance to advanced highway and urban navigation capabilities, with significant improvements in system generalization [81][90]. Group 6: NIO's Development - NIO's approach is characterized by a cautious evolution from collaborative development to full-stack self-research, focusing on safety and reliability [98]. - The introduction of the World Model NWM in 2025 signifies a new phase in NIO's autonomous driving capabilities, enhancing cognitive and reasoning abilities [110].
3DGS重建!gsplat 库源码解析
自动驾驶之心· 2025-09-23 23:32
Core Insights - The article discusses the implications of OpenAI's new video generation model, Sora, on computer graphics, particularly in relation to 3D Gaussian Splatting (3DGS) and its potential to replace traditional rendering techniques [7][8]. Group 1: 3D Gaussian Splatting (3DGS) - 3DGS is highlighted as a significant area of research, with ongoing developments in its application for self-driving perception and scene reconstruction [4][9]. - The gsplat library is recommended for its better documentation and maintenance compared to the original Gaussian Splatting library, indicating a preference for more user-friendly resources in the field [5]. - The article mentions the potential for 3DGS to integrate with other technologies, such as NeRF (Neural Radiance Fields), to enhance video generation and scene understanding [4][9]. Group 2: Technical Aspects of Sora and 3DGS - Sora's capabilities are positioned as a potential game-changer in computer graphics, with the possibility of it being recognized as a foundational technology in the field [6][7]. - The article outlines various technical components of 3DGS, including the use of Gaussian parameters, covariance matrices, and the importance of camera coordinate transformations [21][22][30]. - The compression capabilities of gsplat are noted, with the ability to reduce Gaussian parameters significantly while maintaining quality, which is crucial for efficient rendering [13][14]. Group 3: Future Prospects and Community Engagement - The article expresses optimism about the broader application of "world models" in video generation and scene reconstruction, suggesting that even smaller players in the industry could benefit from advancements in these technologies [9]. - The community around autonomous driving and related technologies is emphasized, with numerous technical groups and resources available for learning and collaboration [78].