可微分token剪枝
Search documents
LightVLA:你的VLA真的可以又强又快!
具身智能之心· 2025-10-14 00:02
Core Insights - LightVLA is an innovative differentiable token pruning framework designed for vision-language-action (VLA) models, enabling them to focus on critical visual information while significantly reducing computational costs and improving performance [2][8]. Group 1: LightVLA Overview - LightVLA addresses the computational challenges faced by VLA models on resource-constrained platforms by implementing adaptive and performance-driven visual token pruning [2]. - The framework generates dynamic queries to assess the importance of visual tokens and employs Gumbel softmax for differentiable token selection, allowing for the retention of the most informative tokens while discarding irrelevant ones [2][3]. Group 2: Performance Metrics - Experimental results indicate that LightVLA outperforms various VLA models and existing token pruning methods across multiple tasks in the LIBERO benchmark, achieving a 59.1% reduction in computational load (FLOPs) and a 38.2% decrease in latency, while increasing task success rate by 2.6% [3][8]. - The success rate achieved by LightVLA is reported to be 97.4%, marking a significant improvement in efficiency and performance [8]. Group 3: Research Significance - LightVLA is the first framework to apply adaptive visual token pruning to VLA tasks while simultaneously optimizing efficiency and performance, representing a critical advancement towards efficient, powerful, and practical real-time robotic systems [3].
超越免训练剪枝:LightVLA引入可微分token剪枝,首次实现VLA模型性能和效率的双重突破
机器之心· 2025-09-23 04:08
Core Insights - The article introduces LightVLA, a framework designed to enhance the inference efficiency and performance of Visual-Language-Action (VLA) models, addressing the high computational costs and inference delays that limit their deployment in applications like home robotics [5][9][33] - LightVLA employs two core innovations: a differentiable visual token pruning framework and a learnable query-based token selection mechanism, allowing the model to adaptively focus on the most informative visual tokens [5][8][33] Innovation Highlights - LightVLA identifies and prunes redundant visual tokens in VLA models, utilizing a Gumbel-softmax guided process for token selection, which enhances the model's ability to choose critical visual tokens and accelerates inference [5][6][8] - The framework demonstrates state-of-the-art (SOTA) performance on the LIBERO benchmark, surpassing traditional VLA models while achieving efficient inference acceleration [6][29] Research Motivation and Challenges - The motivation behind the research stems from the inherent redundancy of visual tokens in VLA models, which contributes to computational bottlenecks and performance degradation [9][33] - Traditional pruning methods often face a trade-off between efficiency and performance, necessitating smarter pruning techniques that allow the model to focus on relevant information [9][33] Methodology Overview - LightVLA utilizes a series of query tokens to assess the importance of visual tokens, employing a differentiable pruning algorithm that allows the model to learn which tokens to retain based on their contribution to task performance [16][19][30] - The framework's architecture eliminates the need for heuristic hyperparameter settings, enabling adaptive token selection during the fine-tuning process [15][19] Experimental Results - LightVLA achieves an average success rate of 97.4% across all tasks in the LIBERO benchmark, outperforming various strong baseline models while maintaining a significantly lower number of visual tokens (average of 78) [29][30] - The framework reduces FLOPs and latency by 59.1% and 38.2%, respectively, while simultaneously improving performance, marking it as the only acceleration method that enhances both efficiency and effectiveness [29][30] Conclusion - The research presents LightVLA as a novel solution to the visual redundancy challenge in VLA models, achieving superior performance with reduced computational costs and delays, paving the way for lightweight and deployable VLA models in practical applications [33]