Workflow
可微分token剪枝
icon
Search documents
超越免训练剪枝:LightVLA引入可微分token剪枝,首次实现VLA模型性能和效率的双重突破
机器之心· 2025-09-23 04:08
Core Insights - The article introduces LightVLA, a framework designed to enhance the inference efficiency and performance of Visual-Language-Action (VLA) models, addressing the high computational costs and inference delays that limit their deployment in applications like home robotics [5][9][33] - LightVLA employs two core innovations: a differentiable visual token pruning framework and a learnable query-based token selection mechanism, allowing the model to adaptively focus on the most informative visual tokens [5][8][33] Innovation Highlights - LightVLA identifies and prunes redundant visual tokens in VLA models, utilizing a Gumbel-softmax guided process for token selection, which enhances the model's ability to choose critical visual tokens and accelerates inference [5][6][8] - The framework demonstrates state-of-the-art (SOTA) performance on the LIBERO benchmark, surpassing traditional VLA models while achieving efficient inference acceleration [6][29] Research Motivation and Challenges - The motivation behind the research stems from the inherent redundancy of visual tokens in VLA models, which contributes to computational bottlenecks and performance degradation [9][33] - Traditional pruning methods often face a trade-off between efficiency and performance, necessitating smarter pruning techniques that allow the model to focus on relevant information [9][33] Methodology Overview - LightVLA utilizes a series of query tokens to assess the importance of visual tokens, employing a differentiable pruning algorithm that allows the model to learn which tokens to retain based on their contribution to task performance [16][19][30] - The framework's architecture eliminates the need for heuristic hyperparameter settings, enabling adaptive token selection during the fine-tuning process [15][19] Experimental Results - LightVLA achieves an average success rate of 97.4% across all tasks in the LIBERO benchmark, outperforming various strong baseline models while maintaining a significantly lower number of visual tokens (average of 78) [29][30] - The framework reduces FLOPs and latency by 59.1% and 38.2%, respectively, while simultaneously improving performance, marking it as the only acceleration method that enhances both efficiency and effectiveness [29][30] Conclusion - The research presents LightVLA as a novel solution to the visual redundancy challenge in VLA models, achieving superior performance with reduced computational costs and delays, paving the way for lightweight and deployable VLA models in practical applications [33]