Workflow
FastDriveVLA
icon
Search documents
AAAI 2026 | 小鹏联合北大,专为VLA模型定制视觉token剪枝方法
具身智能之心· 2026-01-05 01:03
Core Viewpoint - The article discusses the development of FastDriveVLA, a new framework for efficient visual token pruning in end-to-end autonomous driving systems, which significantly reduces computational costs and improves inference efficiency [1][8]. Group 1: Research Background and Problem - End-to-end autonomous driving shows great potential to transform future transportation systems, learning the entire driving process within a unified framework, thus reducing errors in information transfer between modules [7]. - Existing VLA models convert visual inputs into a large number of visual tokens, leading to significant computational overhead and increased inference latency, posing challenges for real-world deployment [7][8]. - Previous research aimed at reducing visual tokens has limitations in autonomous driving scenarios, as new designs often require retraining the entire model, and pruning strategies based on attention or similarity may retain irrelevant information [7][8]. Group 2: Methodology and Innovations - FastDriveVLA introduces a novel, reconstruction-based visual token pruning framework specifically tailored for end-to-end autonomous driving [8]. - The research team hypothesized that visual tokens related to foreground information are more valuable than those related to background content, leading to the creation of the nuScenes-FG dataset, which includes 241,000 images with foreground annotations [2][13]. - The lightweight, plug-and-play pruning tool, ReconPruner, is designed to effectively identify and select meaningful foreground visual tokens, utilizing a masked image modeling approach for pixel reconstruction [16][19]. Group 3: Experimental Results - FastDriveVLA achieved state-of-the-art (SOTA) performance in open-loop planning benchmarks on the nuScenes dataset, demonstrating significant efficiency improvements [2][20]. - When the number of visual tokens was reduced from 3,249 to 812, FastDriveVLA's FLOPs decreased by approximately 7.5 times, and it reduced prefill time by 3.7 times and decode time by 1.3 times, enhancing inference efficiency [26][27]. - The framework outperformed existing methods across various pruning ratios, particularly at a 50% pruning rate, where it maintained a balanced performance across all metrics [25][28]. Group 4: Efficiency Analysis - FastDriveVLA's efficiency was analyzed in terms of FLOPs and CUDA latency, showing a significant reduction in computational requirements while maintaining high performance [26][27]. - At a 25% pruning rate, FastDriveVLA demonstrated the best performance across all evaluation metrics, indicating that focusing on foreground-related visual tokens is crucial for enhancing autonomous driving performance [28].
AAAI 2026 | 小鹏联合北大,专为VLA模型定制视觉token剪枝方法,让端到端自动驾驶更高效
机器之心· 2026-01-04 05:43
Core Insights - The article discusses the increasing application of VLA models in end-to-end autonomous driving systems, highlighting the challenges posed by lengthy visual tokens that significantly raise computational costs [2][8] - A new paradigm for efficient visual token pruning in autonomous driving VLA models is introduced through the paper "FastDriveVLA," co-authored by Xiaopeng Motors and Peking University [2][5] - The research proposes that visual tokens related to foreground information are more valuable than those related to background content, leading to the development of a large-scale annotated dataset, nuScenes-FG, containing 241,000 images with foreground area annotations [2][13] Summary by Sections Research Background and Issues - End-to-end autonomous driving shows great potential to transform future transportation systems, learning the entire driving process within a unified framework [6] - Existing VLA models convert visual inputs into numerous visual tokens, resulting in significant computational overhead and increased inference latency, posing challenges for real-world deployment [8] Methodology and Innovations - FastDriveVLA is a novel, reconstruction-based visual token pruning framework tailored for end-to-end autonomous driving VLA models [10] - The framework includes a lightweight, plug-and-play pruner called ReconPruner, which identifies and selects meaningful foreground visual tokens using a masked image modeling approach [16][18] - An innovative adversarial foreground-background reconstruction strategy is introduced to enhance ReconPruner's ability to distinguish between foreground and background tokens [19] Experimental Results - FastDriveVLA demonstrates state-of-the-art performance across various pruning ratios in the nuScenes open-loop planning benchmark [20][25] - When the number of visual tokens is reduced from 3,249 to 812, FastDriveVLA achieves a reduction in FLOPs by approximately 7.5 times and significantly improves CUDA inference latency [26] - The framework outperforms existing methods, particularly at a 50% pruning ratio, achieving a balanced performance across all metrics [25] Efficiency Analysis - FastDriveVLA's efficiency is highlighted by its substantial reduction in FLOPs and CUDA latency, showcasing its potential for real-time applications in autonomous driving [26][27] - At a 25% pruning rate, FastDriveVLA shows the best performance across all evaluation metrics, indicating that focusing on foreground-related visual tokens is crucial for enhancing autonomous driving performance [28]
【周观点】小鹏联合北大发布FastDriveVLA,继续看好汽车板块
Investment Highlights - The automotive sector has shown positive performance this week, with the SW passenger vehicle and SW auto parts sectors leading with gains of +3.3% each, followed by SW automotive (+2.7%) and SW commercial vehicles (+1.1%), while SW commercial passenger vehicles declined by -2.2% [4][12] - The top-performing stocks this week include Yatong Co., Hengshuai Co., Xusheng Group, Yinlun Co., and Shuanghuan Transmission [4][12] Research Outcomes - The team has released a strategy report for automotive parts for 2026 and a monthly report on buses [5][12] Industry Core Changes 1. Xiaopeng Motors collaborated with Peking University to publish a paper at the international AI conference AAAI 2026, addressing the core contradiction of high computational load and precise decision-making in autonomous driving, showcasing both technological breakthroughs and commercial feasibility [6][12] 2. Horizon Robotics launched the S100P mass production of the Digua robot and officially released the Vbot super robot dog [6][12] 3. The Zhiji brand is expected to achieve full-cost profitability for the first time by December 2025 [6][12] Current Automotive Sector Configuration - The automotive industry is entering a new crossroads phase, with the end of the electric vehicle (EV) dividend and the dawn of automotive intelligence, while robotics innovation is in the 0-1 stage. Three main investment opportunities are emerging during this transition [8][13] - **AI Smart Vehicle Main Line**: Focus on Robotaxi/van, with downstream application core targets including: - **Robotaxi Perspective**: Integrated models like Tesla, Xiaopeng Motors, and Qianli Technology; technology providers with operational sharing models like Horizon Robotics and Baidu; transformation of ride-hailing/taxi services with companies like Cao Cao Mobility, Didi, and others [8][13] - **Robovan Perspective**: Key focus on Desay SV and others [8][13] - **C-end Vehicle Sales Perspective**: Whole vehicles from Xiaopeng Motors, Li Auto, Huawei, Xiaomi, etc. [8][13] - **Upstream Supply Chain Core Targets**: B-end vehicle manufacturing by companies like BAIC Blue Valley, GAC Group, and SAIC Group; core suppliers in testing, chips, domain controllers, sensors, and more [8][13] - **AI Robotics Main Line**: Preferred components from Top Group, Junsheng Electronics, Xinquan Technology, and others [8][13] - **Dividend & Good Pattern Main Line**: Focus on buses (Yutong Bus), heavy trucks (China National Heavy Duty Truck Group, Weichai Power), and two-wheelers (Chunfeng Power, Longxin General) [9][13]
汽车周观点:小鹏联合北大发布FastDriveVLA,继续看好汽车板块-20251229
Soochow Securities· 2025-12-29 11:09
Investment Rating - The report maintains a positive outlook on the automotive sector, particularly highlighting the potential of AI-driven vehicles and related technologies [1][3]. Core Insights - The automotive industry is at a crossroads, transitioning from electric vehicle benefits to a focus on intelligent vehicles and robotics innovation [3]. - Key developments include the collaboration between Xiaopeng Motors and Peking University on the FastDrive VLA model, which addresses significant challenges in autonomous driving [2][3]. - The report anticipates a significant increase in the penetration of L3 and L2+ intelligent driving technologies by 2025, with expected market growth driven by major players like Tesla and Huawei [48][49]. Summary by Sections Market Performance - This week, the automotive sector outperformed the market, with passenger vehicles and auto parts showing the best performance, increasing by 3.3% [2][3]. - The report notes that the automotive sector ranked 11th in A-shares and 18th in Hong Kong stocks this week [7][9]. Investment Opportunities - The report identifies three main investment themes: AI smart vehicles, robotics, and traditional vehicle segments benefiting from favorable market conditions [3]. - Key investment targets include: - **AI Smart Vehicles**: Focus on Robotaxi and Robovan models, with companies like Tesla, Xiaopeng, and Horizon Robotics highlighted [3]. - **Robotics**: Emphasis on component suppliers and companies involved in the development of humanoid robots and related technologies [54][60]. - **Traditional Vehicles**: Companies like Yutong Bus and China National Heavy Duty Truck are expected to benefit from ongoing demand and favorable policies [50][53]. Sales Forecasts - The report projects that domestic retail sales of passenger vehicles will reach 23.62 million units in 2025, representing a year-on-year growth of 3.8% [45]. - The penetration rate of new energy vehicles is expected to increase to 55.4% by 2025, with significant growth in both domestic and export markets [49][53]. Key Companies and Developments - Notable companies mentioned include Xiaopeng Motors, Ideal Auto, and Horizon Robotics, with significant advancements in technology and profitability reported [3][59]. - The report highlights the importance of strategic partnerships and technological advancements in driving future growth within the sector [60].
XPENG-Peking University Collaborative Research Accepted by AAAI 2026: Introducing a Novel Visual Token Pruning Framework for Autonomous Driving
Prnewswire· 2025-12-29 05:35
Core Insights - XPENG, in collaboration with Peking University, has developed FastDriveVLA, a novel visual token pruning framework for autonomous driving AI, which has been accepted at AAAI 2026, a prestigious AI conference with an acceptance rate of 17.6% [1][10]. Technology Development - FastDriveVLA focuses on efficient visual token pruning, allowing AI to prioritize essential visual information while filtering out irrelevant data, thereby enabling autonomous driving systems to "drive like a human" [2][4]. - The framework employs an adversarial foreground-background reconstruction strategy to enhance the model's ability to retain valuable tokens, achieving a significant reduction in computational load [5]. Performance Metrics - On the nuScenes autonomous driving benchmark, FastDriveVLA demonstrated state-of-the-art performance, achieving a nearly 7.5x reduction in computational load when visual tokens were reduced from 3,249 to 812, while maintaining high planning accuracy [5]. Industry Recognition - This marks the second recognition for XPENG at a top-tier global AI conference in 2025, following its participation in CVPR WAD, where it presented advancements in autonomous driving foundation models [6]. - XPENG's commitment to achieving L4 level autonomous driving is underscored by its full-stack in-house capabilities, which encompass model architecture design, training, and vehicle deployment [7]. Company Overview - XPENG is positioned as a leader in future mobility transformation, with R&D centers across China and a global strategy for research, development, and sales, including a presence in the United States and Europe [8][9].
小鹏联合北大提出全新视觉Token剪枝框架,何小鹏:在探索L4路上又取得新突破
Xin Lang Cai Jing· 2025-12-28 07:56
Core Insights - The paper titled "FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning" has been accepted at the AAAI 2026 conference, showcasing a novel visual token pruning framework specifically designed for end-to-end autonomous driving VLA models [1][8] - FastDriveVLA introduces a plug-and-play visual token pruner called ReconPruner, which can be directly integrated into the autonomous driving VLA model during inference without the need for retraining the entire model [1][8] - A large-scale dataset, nuScenes-FG, consisting of 241,000 image-mask pairs from six camera perspectives, was created to assist in training the pruning mechanism, which can be widely used for future autonomous driving research [1][4] Performance Metrics - Testing on the nuScenes autonomous driving dataset demonstrated that the pruning framework achieved state-of-the-art (SOTA) results at various pruning rates: at a 25% pruning rate, driving performance remained nearly unchanged, with L2 trajectory error and collision rates surpassing those of the unpruned baseline model [2][9] - At a 50% pruning rate, the model exhibited balanced performance across all metrics, while also significantly enhancing the inference efficiency of the VLA model [2][9] Technical Innovations - The FastDriveVLA framework is inspired by human driving behavior, focusing on selective information processing to discard redundant visual tokens while retaining critical ones [3][11] - The framework employs a foreground-background adversarial reconstruction strategy to differentiate between essential and non-essential visual tokens, thereby optimizing the model's performance [3][11] - The visual token pruner, ReconPruner, has a parameter count of only 0.07 billion and is adaptable to various VLA models, showcasing its efficiency [4][12] Efficiency Improvements - Comparative analysis of different pruning methods revealed that FastDriveVLA outperformed existing techniques at pruning rates of 25%, 50%, and 75%, achieving SOTA results [4][13] - When the initial number of input tokens was reduced from 3,249 to 812, FastDriveVLA's FLOPs decreased by approximately 7.5 times, and CUDA inference latency was significantly improved, with prefill time accelerated by 3.7 times and decoding time by 1.3 times [4][13][6]
面向量产VLA!FastDriveVLA:即插即用剪枝模块,推理加速近4倍
自动驾驶之心· 2025-08-23 16:03
Core Viewpoint - The article discusses the development of FastDriveVLA, a novel visual token pruning framework designed for autonomous driving, achieving a 50% compression rate while maintaining 97.3% performance [3][13][43]. Group 1: End-to-End Autonomous Driving - Recent advancements in end-to-end autonomous driving research have led to the adoption of visual-language-action (VLA) models, which outperform traditional modular approaches in complex scene understanding and decision-making [3][10]. - The VLA model integrates perception, action generation, and planning into a single framework, reducing information loss between modules [3][4]. Group 2: Visual Token Pruning Techniques - Existing VLM/VLA models face high computational costs due to the encoding of images into numerous visual tokens, prompting research into visual token pruning methods [4][11]. - Two primary approaches for visual token pruning are attention mechanism-based methods and similarity-based methods, both of which have limitations in driving tasks [4][14]. - FastDriveVLA introduces a reconstruction-based visual token pruning framework that focuses on retaining tokens related to foreground areas critical for driving decisions [5][13]. Group 3: FastDriveVLA Framework - FastDriveVLA employs a plug-and-play pruner called ReconPruner, trained using a pixel reconstruction task to emphasize foreground information [6][17]. - The framework includes an adversarial foreground-background reconstruction strategy to enhance the model's ability to distinguish between foreground and background tokens [20][21]. - A large-scale dataset, nuScenes-FG, was constructed to support the training of ReconPruner, containing 241,000 image-mask pairs for effective foreground segmentation [6][12][13]. Group 4: Experimental Results - FastDriveVLA achieved state-of-the-art results on the nuScenes closed-loop planning benchmark, demonstrating its effectiveness and practicality [13][28]. - The framework was evaluated under various pruning ratios (25%, 50%, 75%), consistently outperforming existing methods in key metrics such as L2 error and collision rates [30][34]. - The efficiency analysis showed that FastDriveVLA significantly reduces FLOPs and CUDA latency compared to other methods, enhancing real-time deployment capabilities [36][40]. Group 5: Contributions and Implications - The introduction of FastDriveVLA provides a new paradigm for efficient inference in VLA models, offering insights into task-specific token pruning strategies [43]. - The research highlights the importance of focusing on foreground information in autonomous driving tasks, which can lead to improved performance and reduced computational costs [5][43].
自动驾驶论文速递 | 扩散模型、轨迹预测、TopoLiDM、VLA等~
自动驾驶之心· 2025-08-05 03:09
Core Insights - The article discusses advancements in trajectory prediction using a generative active learning framework called GALTraj, which applies controllable diffusion models to address long-tail issues in data [1][2]. Group 1: GALTraj Framework - GALTraj is the first framework to apply generative active learning to trajectory prediction tasks, enhancing long-tail learning without modifying the model structure [2]. - The framework employs a tail-aware generation method that differentiates the diffusion guidance for tail, head, and related agents, producing realistic and diverse scenarios while preserving tail characteristics [2][3]. Group 2: Experimental Results - In experiments on WOMD and Argoverse2 datasets, GALTraj significantly improved long-tail sample prediction performance, reducing the long-tail metric FPR₅ by 47.6% (from 0.42 to 0.22) and overall prediction error minFDE₆ by 14.7% (from 0.654 to 0.558) [1][6]. - The results indicate that GALTraj outperforms traditional methods across various metrics, showcasing its effectiveness in enhancing prediction accuracy for rare scenarios [7][8]. Group 3: TopoLiDM Framework - The article also highlights the TopoLiDM framework developed by Shanghai Jiao Tong University and Twente University, which integrates topology-aware diffusion models for high-fidelity LiDAR point cloud generation [13][15]. - TopoLiDM achieved a 22.6% reduction in the Fréchet Range Image Distance (FRID) and a 9.2% reduction in Minimum Matching Distance (MMD) on the KITTI-360 dataset while maintaining a real-time generation speed of 1.68 samples per second [13][15]. Group 4: FastDriveVLA Framework - FastDriveVLA, developed by Peking University and Xiaopeng Motors, introduces a reconstruction-based visual token pruning framework that maintains 99.1% trajectory accuracy with a 50% pruning rate and reduces collision rates by 2.7% [21][22]. - The framework employs a novel adversarial foreground-background reconstruction strategy to enhance the identification of valuable tokens, achieving state-of-the-art performance on the nuScenes open-loop planning benchmark [27][28]. Group 5: PLA Framework - The article presents a unified Perception-Language-Action (PLA) framework proposed by TUM, which integrates multi-sensor fusion and GPT-4.1 enhanced visual-language-action reasoning for adaptive autonomous driving [34][35]. - The framework demonstrated a mean absolute error (MAE) of 0.39 m/s in speed prediction and an average displacement error (ADE) of 1.013 meters in trajectory tracking within urban intersection scenarios [42].
面向量产VLA方案!FastDriveVLA:即插即用剪枝模块,推理加速近4倍(北大&小鹏)
自动驾驶之心· 2025-08-04 23:33
Core Viewpoint - The article discusses the development of FastDriveVLA, a novel framework for visual token pruning in autonomous driving, achieving a 50% compression rate while maintaining 97.3% performance [2][3][43]. Group 1: End-to-End Autonomous Driving - Recent advancements in end-to-end autonomous driving research have led to the adoption of end-to-end methods that complete perception to planning in a single model, reducing information loss between modules [3]. - The introduction of Visual-Language-Action (VLA) models enhances decision-making in complex scenarios, making them increasingly popular in autonomous driving systems [3][10]. Group 2: Visual Token Pruning - Existing VLM/VLA models encode images into numerous visual tokens, resulting in high computational costs. Current research explores two main directions for visual token pruning: attention mechanism-based methods and similarity-based methods [4][14]. - FastDriveVLA proposes a reconstruction-based visual token pruning framework that focuses on retaining tokens related to foreground information, significantly reducing computational costs while maintaining performance [5][13]. Group 3: FastDriveVLA Framework - FastDriveVLA includes a plug-and-play pruner called ReconPruner, trained using a pixel reconstruction task to focus on foreground areas and assign higher significance scores to key tokens [6][17]. - The framework utilizes a large-scale dataset, nuScenes-FG, containing 241,000 image-mask pairs for training, enhancing the model's ability to distinguish between foreground and background [6][12]. Group 4: Experimental Results - FastDriveVLA achieved state-of-the-art results on the nuScenes closed-loop planning benchmark, demonstrating its effectiveness and practicality [13][34]. - The framework shows superior performance compared to existing methods, with improvements in L2 error and collision rates at various pruning ratios [30][34]. Group 5: Efficiency Analysis - FastDriveVLA significantly reduces FLOPs by approximately 7.5 times and decreases prefill and decode latencies, enhancing inference efficiency for real-time deployment [36][40]. - The lightweight design of ReconPruner allows for lower CUDA latency compared to several similar methods, making it suitable for practical applications [36][40].