FastDriveVLA

Search documents
面向量产VLA!FastDriveVLA:即插即用剪枝模块,推理加速近4倍
自动驾驶之心· 2025-08-23 16:03
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 今天自动驾驶之心为大家分享 北京大学,小鹏汽车 最新的工作! FastDriveVLA:对抗性视觉token剪枝,50%压缩率下性能保持97.3%! 如果 您有相关工作需要分享,请在文末联系我们! 自动驾驶课程学习与 技术交流群加入 ,也欢迎添加小助理微信AIDriver005 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 论文作者 | Jiajun Cao等 编辑 | 自动驾驶之心 写在前面 && 笔者理解 近年来,端到端自动驾驶研究进展神速,各家也都在如火如荼的宣传自家的端到端方案。与传统模块化方案(感知→预测→规划)不同,端到端方法在同一个模 型中完成全部感知到规划的过程,有效减少了不同模块之间的信息损失,也从某种角度简化了系统架构。但是技术的进步不止于此,随着视觉-语言大模型 (VLM)在视觉问答任务中展现出令人称奇的推理能力,很多研究人员及算法团队开始将其扩展至具身智能和自动驾驶领域,通过引入动作生成功能,形成了视 觉-语言-动作(VLA)模型。相较传统模块化方案,VLA 模型在复杂场景理解与 ...
自动驾驶论文速递 | 扩散模型、轨迹预测、TopoLiDM、VLA等~
自动驾驶之心· 2025-08-05 03:09
Core Insights - The article discusses advancements in trajectory prediction using a generative active learning framework called GALTraj, which applies controllable diffusion models to address long-tail issues in data [1][2]. Group 1: GALTraj Framework - GALTraj is the first framework to apply generative active learning to trajectory prediction tasks, enhancing long-tail learning without modifying the model structure [2]. - The framework employs a tail-aware generation method that differentiates the diffusion guidance for tail, head, and related agents, producing realistic and diverse scenarios while preserving tail characteristics [2][3]. Group 2: Experimental Results - In experiments on WOMD and Argoverse2 datasets, GALTraj significantly improved long-tail sample prediction performance, reducing the long-tail metric FPR₅ by 47.6% (from 0.42 to 0.22) and overall prediction error minFDE₆ by 14.7% (from 0.654 to 0.558) [1][6]. - The results indicate that GALTraj outperforms traditional methods across various metrics, showcasing its effectiveness in enhancing prediction accuracy for rare scenarios [7][8]. Group 3: TopoLiDM Framework - The article also highlights the TopoLiDM framework developed by Shanghai Jiao Tong University and Twente University, which integrates topology-aware diffusion models for high-fidelity LiDAR point cloud generation [13][15]. - TopoLiDM achieved a 22.6% reduction in the Fréchet Range Image Distance (FRID) and a 9.2% reduction in Minimum Matching Distance (MMD) on the KITTI-360 dataset while maintaining a real-time generation speed of 1.68 samples per second [13][15]. Group 4: FastDriveVLA Framework - FastDriveVLA, developed by Peking University and Xiaopeng Motors, introduces a reconstruction-based visual token pruning framework that maintains 99.1% trajectory accuracy with a 50% pruning rate and reduces collision rates by 2.7% [21][22]. - The framework employs a novel adversarial foreground-background reconstruction strategy to enhance the identification of valuable tokens, achieving state-of-the-art performance on the nuScenes open-loop planning benchmark [27][28]. Group 5: PLA Framework - The article presents a unified Perception-Language-Action (PLA) framework proposed by TUM, which integrates multi-sensor fusion and GPT-4.1 enhanced visual-language-action reasoning for adaptive autonomous driving [34][35]. - The framework demonstrated a mean absolute error (MAE) of 0.39 m/s in speed prediction and an average displacement error (ADE) of 1.013 meters in trajectory tracking within urban intersection scenarios [42].
面向量产VLA方案!FastDriveVLA:即插即用剪枝模块,推理加速近4倍(北大&小鹏)
自动驾驶之心· 2025-08-04 23:33
Core Viewpoint - The article discusses the development of FastDriveVLA, a novel framework for visual token pruning in autonomous driving, achieving a 50% compression rate while maintaining 97.3% performance [2][3][43]. Group 1: End-to-End Autonomous Driving - Recent advancements in end-to-end autonomous driving research have led to the adoption of end-to-end methods that complete perception to planning in a single model, reducing information loss between modules [3]. - The introduction of Visual-Language-Action (VLA) models enhances decision-making in complex scenarios, making them increasingly popular in autonomous driving systems [3][10]. Group 2: Visual Token Pruning - Existing VLM/VLA models encode images into numerous visual tokens, resulting in high computational costs. Current research explores two main directions for visual token pruning: attention mechanism-based methods and similarity-based methods [4][14]. - FastDriveVLA proposes a reconstruction-based visual token pruning framework that focuses on retaining tokens related to foreground information, significantly reducing computational costs while maintaining performance [5][13]. Group 3: FastDriveVLA Framework - FastDriveVLA includes a plug-and-play pruner called ReconPruner, trained using a pixel reconstruction task to focus on foreground areas and assign higher significance scores to key tokens [6][17]. - The framework utilizes a large-scale dataset, nuScenes-FG, containing 241,000 image-mask pairs for training, enhancing the model's ability to distinguish between foreground and background [6][12]. Group 4: Experimental Results - FastDriveVLA achieved state-of-the-art results on the nuScenes closed-loop planning benchmark, demonstrating its effectiveness and practicality [13][34]. - The framework shows superior performance compared to existing methods, with improvements in L2 error and collision rates at various pruning ratios [30][34]. Group 5: Efficiency Analysis - FastDriveVLA significantly reduces FLOPs by approximately 7.5 times and decreases prefill and decode latencies, enhancing inference efficiency for real-time deployment [36][40]. - The lightweight design of ReconPruner allows for lower CUDA latency compared to several similar methods, making it suitable for practical applications [36][40].