Workflow
小鹏VLA
icon
Search documents
何小鹏“约赌”马斯克,小鹏能否成为“中国特斯拉”?
以下文章来源于Auto有范儿 ,作者王小娟 Auto有范儿 . 讲述独特的汽车故事。 导语:2026年小鹏Ultra会超越行业智驾驶产数十倍甚至以上? 今年还未过去,汽车行业就智能辅助驾驶的领先已经开始预约明年的"战火"了。 12月11日,小鹏汽车董事长、CEO何小鹏在社交媒体上立下一则"赌约": 若到明年8月30日, 小鹏VLA技术在国内达到特斯拉FSD V14.2在硅谷的整体水平,他将在硅谷筹建中国风味食堂; 若失败,自动驾驶负责人Head Xianming(刘先明)将在金门大桥"裸跑"。 何小鹏立下这个赌约,放话与特斯拉一争高下,也因为对特斯拉FSD系统长期的体验与观察。 何小鹏认为,未来将是同一个自动驾驶系统和硬件体系的时代,人们可以直接享有L4能力的汽 车。这一判断与小鹏汽车自身的技术路线选择密切相关。 在公开称赞特斯拉FSD的同时,何小鹏也不掩饰小鹏VLA与FSD当前存在的差距。他坦承:"小鹏 VLA的第一个版本尚无法完全实现FSD V14.2的全部能力。" 但同时也透露,小鹏VLA 2.0将在 下个季度发布。 对于追赶时间表,何小鹏在今年8月曾表示,当前智能辅助驾驶的第一梯队玩家"半斤八两",有 ...
基于准确的原始材料对比小鹏理想VLA
理想TOP2· 2025-11-20 10:42
Core Viewpoint - The article discusses the advancements in autonomous driving technology, particularly focusing on the VLA (Vision-Language-Action) architecture developed by Li Auto and the insights shared by Xiaopeng's autonomous driving head, Liu Xianming, during a podcast. Liu emphasizes the removal of the intermediate language component (L) to enhance scalability and efficiency in data usage [1][4][5]. Summary by Sections VLA Architecture and Training Process - The VLA architecture involves a pre-training phase using a 32 billion parameter (32B) vision-language model that incorporates 3D vision and high-definition 2D vision, improving clarity by 3-5 times compared to open-source models. It also includes driving-related language data and key VL joint data [10][11]. - The model is distilled into a 3.2 billion parameter (3.2B) MoE model to ensure fast inference on vehicle hardware, followed by a post-training phase that integrates action to form the VLA, increasing the parameter count to nearly 4 billion [13][12]. - The reinforcement learning phase consists of two parts: human feedback reinforcement learning (RLHF) and pure reinforcement learning using world model-generated data, focusing on comfort, collision avoidance, and adherence to traffic regulations [15][16]. Data Utilization and Efficiency - Liu argues that using language as a supervisory signal can introduce human biases, reducing data efficiency and scalability. The most challenging data to collect are corner cases, which are crucial for training [4][6]. - The architecture aims to achieve a high level of generalization, with plans to implement L4 robotaxi services in Guangzhou based on the current framework [4][5]. Future Directions and Challenges - Liu acknowledges the uncertainties in scaling the technology and ensuring safety, questioning how to maintain safety standards and align the model with human behavior [5][18]. - The conversation highlights that the VLA, VLM, and world model are fundamentally end-to-end architectures, with various companies working on similar concepts in the realm of Physical AI [5][18]. Human-Agent Interaction - The driver agent is designed to process short commands directly, while complex instructions are sent to the cloud for processing before execution. This approach allows the system to understand and interact with the physical world like a human driver [17][18]. - The article concludes that the traffic domain is a suitable environment for VLA implementation due to its defined rules and the ability to model human driving behavior effectively [19][20].
做自动驾驶VLA的这一年
自动驾驶之心· 2025-11-19 00:03
Core Viewpoint - The article discusses the emergence and significance of Vision-Language-Action (VLA) models in the autonomous driving industry, highlighting their potential to unify perception, reasoning, and action in a single framework, thus addressing the limitations of previous models [3][10][11]. Summary by Sections What is VLA? - VLA models are described as multimodal systems that integrate vision, language, and actions, allowing for a more comprehensive understanding and interaction with the environment [4][7]. - The concept originated from robotics and was popularized in the autonomous driving sector due to its potential to enhance interpretability and decision-making capabilities [3][9]. Why VLA Emerged? - The evolution of autonomous driving can be categorized into several phases: modular systems, end-to-end models, and Vision-Language Models (VLM), each with its own limitations [9][10]. - VLA models emerged as a solution to the shortcomings of previous approaches, providing a unified framework that enhances both understanding and action execution [10][11]. VLA Architecture Breakdown - The VLA model architecture consists of three main components: input (multimodal data), processing (integration of inputs), and output (action generation) [12][16]. - Inputs include visual data from cameras, sensor data from LiDAR and RADAR, and language inputs for navigation and interaction [13][14]. - The processing layer integrates these inputs to generate driving decisions, while the output layer produces control commands and trajectory planning [18][20]. Development History of VLA - The article outlines the historical context of VLA development, emphasizing its role in advancing autonomous driving technology by addressing the need for better interpretability and action alignment [21][22]. Key Innovations in VLA Models - Recent models like LINGO-1 and LINGO-2 focus on integrating natural language understanding with driving actions, allowing for more interactive and responsive driving systems [22][35]. - Innovations include the ability to explain driving decisions in natural language and to follow complex verbal instructions, enhancing user trust and system transparency [23][36]. Future Directions - The article raises questions about the necessity of language in future VLA models, suggesting that as technology advances, the role of language may evolve or diminish [70]. - It emphasizes the importance of continuous learning and innovation in the field to keep pace with technological advancements and user expectations [70].