Workflow
LightVLA
icon
Search documents
超越免训练剪枝:LightVLA引入可微分token剪枝,首次实现VLA模型性能和效率的双重突破
机器之心· 2025-09-23 04:08
Core Insights - The article introduces LightVLA, a framework designed to enhance the inference efficiency and performance of Visual-Language-Action (VLA) models, addressing the high computational costs and inference delays that limit their deployment in applications like home robotics [5][9][33] - LightVLA employs two core innovations: a differentiable visual token pruning framework and a learnable query-based token selection mechanism, allowing the model to adaptively focus on the most informative visual tokens [5][8][33] Innovation Highlights - LightVLA identifies and prunes redundant visual tokens in VLA models, utilizing a Gumbel-softmax guided process for token selection, which enhances the model's ability to choose critical visual tokens and accelerates inference [5][6][8] - The framework demonstrates state-of-the-art (SOTA) performance on the LIBERO benchmark, surpassing traditional VLA models while achieving efficient inference acceleration [6][29] Research Motivation and Challenges - The motivation behind the research stems from the inherent redundancy of visual tokens in VLA models, which contributes to computational bottlenecks and performance degradation [9][33] - Traditional pruning methods often face a trade-off between efficiency and performance, necessitating smarter pruning techniques that allow the model to focus on relevant information [9][33] Methodology Overview - LightVLA utilizes a series of query tokens to assess the importance of visual tokens, employing a differentiable pruning algorithm that allows the model to learn which tokens to retain based on their contribution to task performance [16][19][30] - The framework's architecture eliminates the need for heuristic hyperparameter settings, enabling adaptive token selection during the fine-tuning process [15][19] Experimental Results - LightVLA achieves an average success rate of 97.4% across all tasks in the LIBERO benchmark, outperforming various strong baseline models while maintaining a significantly lower number of visual tokens (average of 78) [29][30] - The framework reduces FLOPs and latency by 59.1% and 38.2%, respectively, while simultaneously improving performance, marking it as the only acceleration method that enhances both efficiency and effectiveness [29][30] Conclusion - The research presents LightVLA as a novel solution to the visual redundancy challenge in VLA models, achieving superior performance with reduced computational costs and delays, paving the way for lightweight and deployable VLA models in practical applications [33]
理想发布机器人领域VLA模型优化框架
理想TOP2· 2025-09-21 15:08
25年9月16日,理想发布 The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning 通讯作者为理想郎咸朋,Titong Jiang与Xuefeng Jiang为共同一作,理想Yuan Ma为项目负责人。理想汽车为第一单位,清华大学车辆与运载学院为第二单 位,中科院计算所为第三单位。 理想这篇论文发布的是LightVLA,LightVLA是首个能同时提升机器人VLA模型任务成功率和运行效率的自适应视觉token pruning框架。 理想做的事是将 Token Pruning 问题从一个以牺牲性能为代价的压缩任务变为一个纯粹由性能驱动的优化任务 。模型在学习过程中,为了追求任务的最 高成功率,会自发地学会剪掉那些对任务无益甚至产生干扰(噪声)的 视觉Tokens ,从而在提升性能的同时,自然而然地实现了计算效率的大幅优化 。 Token选择 (Token Selection):在训练中,借助Gumbel-softmax ...
清华联手理想提出LightVLA:剪掉冗余token,推理速度提升38%!
具身智能之心· 2025-09-18 00:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Titong Jiang等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 研究背景与核心挑战 视觉-语言-动作(VLA)模型是机器人 embodied intelligence 的核心技术,能将视觉信息和语言指令直接转化为可执行的机器人动作,在复杂操作(如物体抓取、 长程规划)中展现出强大能力。但这类模型存在一个关键瓶颈: 视觉Token的计算冗余 ——VLA模型通常需要处理数百个视觉Token(如OpenVLA-OFT使用512 个),而注意力机制的计算复杂度随Token数量呈平方增长,导致模型在边缘设备(如家用机器人、自动驾驶)上难以实现实时部署。 现有优化方案存在明显局限: 1. 效率与性能的trade-off :多数Token剪枝方法(如EfficientVLA、VLA-Cache)为提升效率会固定保留Token数量,导致关键语义信息丢失,最终牺牲性能; 2. VLM剪枝方案不 ...