Workflow
LightVLA
icon
Search documents
LightVLA:你的VLA真的可以又强又快!
具身智能之心· 2025-10-14 00:02
Core Insights - LightVLA is an innovative differentiable token pruning framework designed for vision-language-action (VLA) models, enabling them to focus on critical visual information while significantly reducing computational costs and improving performance [2][8]. Group 1: LightVLA Overview - LightVLA addresses the computational challenges faced by VLA models on resource-constrained platforms by implementing adaptive and performance-driven visual token pruning [2]. - The framework generates dynamic queries to assess the importance of visual tokens and employs Gumbel softmax for differentiable token selection, allowing for the retention of the most informative tokens while discarding irrelevant ones [2][3]. Group 2: Performance Metrics - Experimental results indicate that LightVLA outperforms various VLA models and existing token pruning methods across multiple tasks in the LIBERO benchmark, achieving a 59.1% reduction in computational load (FLOPs) and a 38.2% decrease in latency, while increasing task success rate by 2.6% [3][8]. - The success rate achieved by LightVLA is reported to be 97.4%, marking a significant improvement in efficiency and performance [8]. Group 3: Research Significance - LightVLA is the first framework to apply adaptive visual token pruning to VLA tasks while simultaneously optimizing efficiency and performance, representing a critical advancement towards efficient, powerful, and practical real-time robotic systems [3].
超越免训练剪枝:LightVLA引入可微分token剪枝,首次实现VLA模型性能和效率的双重突破
机器之心· 2025-09-23 04:08
Core Insights - The article introduces LightVLA, a framework designed to enhance the inference efficiency and performance of Visual-Language-Action (VLA) models, addressing the high computational costs and inference delays that limit their deployment in applications like home robotics [5][9][33] - LightVLA employs two core innovations: a differentiable visual token pruning framework and a learnable query-based token selection mechanism, allowing the model to adaptively focus on the most informative visual tokens [5][8][33] Innovation Highlights - LightVLA identifies and prunes redundant visual tokens in VLA models, utilizing a Gumbel-softmax guided process for token selection, which enhances the model's ability to choose critical visual tokens and accelerates inference [5][6][8] - The framework demonstrates state-of-the-art (SOTA) performance on the LIBERO benchmark, surpassing traditional VLA models while achieving efficient inference acceleration [6][29] Research Motivation and Challenges - The motivation behind the research stems from the inherent redundancy of visual tokens in VLA models, which contributes to computational bottlenecks and performance degradation [9][33] - Traditional pruning methods often face a trade-off between efficiency and performance, necessitating smarter pruning techniques that allow the model to focus on relevant information [9][33] Methodology Overview - LightVLA utilizes a series of query tokens to assess the importance of visual tokens, employing a differentiable pruning algorithm that allows the model to learn which tokens to retain based on their contribution to task performance [16][19][30] - The framework's architecture eliminates the need for heuristic hyperparameter settings, enabling adaptive token selection during the fine-tuning process [15][19] Experimental Results - LightVLA achieves an average success rate of 97.4% across all tasks in the LIBERO benchmark, outperforming various strong baseline models while maintaining a significantly lower number of visual tokens (average of 78) [29][30] - The framework reduces FLOPs and latency by 59.1% and 38.2%, respectively, while simultaneously improving performance, marking it as the only acceleration method that enhances both efficiency and effectiveness [29][30] Conclusion - The research presents LightVLA as a novel solution to the visual redundancy challenge in VLA models, achieving superior performance with reduced computational costs and delays, paving the way for lightweight and deployable VLA models in practical applications [33]
理想发布机器人领域VLA模型优化框架
理想TOP2· 2025-09-21 15:08
Core Viewpoint - The article discusses the introduction of LightVLA, a novel adaptive visual token pruning framework that enhances both the success rate and operational efficiency of robot VLA models, addressing the challenges of traditional models in real-world applications [2][3]. Group 1: Technology Framework - LightVLA operates through three main steps: Query Generation, Token Scoring, and Token Selection, allowing for dynamic and parameter-free generation of token queries based on the importance of visual information [5]. - The framework utilizes Gumbel-softmax sampling to enable a differentiable token selection process, facilitating end-to-end learning and optimization [5]. - In benchmark tests, LightVLA improved the average task success rate from 94.5% to 97.4%, reduced floating-point operations (FLOPS) by 59.1%, and decreased end-to-end latency by 38.2% (from 34ms to 21ms) compared to OpenVLA-OFT [5]. Group 2: Performance and Efficiency - LightVLA demonstrates a good compression rate, retaining approximately 78 visual tokens on average, while the baseline model processes 512 tokens, indicating significant redundancy in visual input [6]. - It is the only VLA acceleration method that enhances model performance while achieving acceleration, surpassing all other existing acceleration methods [7].
清华联手理想提出LightVLA:剪掉冗余token,推理速度提升38%!
具身智能之心· 2025-09-18 00:03
Core Insights - The article discusses the development of the LightVLA framework, which aims to enhance the efficiency and performance of Vision-Language-Action (VLA) models in robotics by addressing the computational redundancy associated with visual tokens [2][3]. Research Background and Core Challenges - VLA models are essential for embodied intelligence, converting visual information and language instructions into executable robot actions. However, they face a significant bottleneck due to the computational complexity that increases quadratically with the number of visual tokens [2]. - Existing optimization methods often compromise performance for efficiency, leading to the loss of critical semantic information [3]. Existing Optimization Limitations - Trade-off between efficiency and performance: Many token pruning methods sacrifice performance by retaining a fixed number of tokens [3]. - Incompatibility of pruning schemes: Current visual-language model pruning methods focus on global semantics, which does not translate well to VLA models that require local semantic attention [3]. - Poor deployment compatibility: Pruning methods based on attention scores are not adaptable to mainstream inference frameworks, limiting their practical application [3]. LightVLA Framework Design - LightVLA allows the model to autonomously learn to select task-relevant visual tokens through fine-tuning, rather than relying on manually set pruning ratios [4]. - The framework consists of three modules: visual encoder, LLM backbone, and action head, focusing solely on visual token pruning while retaining the [CLS] token for global information [4]. Core Methodology: Three-Stage Pruning Process 1. **Query Generation**: Task-oriented queries are generated to identify relevant visual tokens without introducing additional parameters [6]. 2. **Token Scoring**: Each visual token is scored based on its relevance to the task, with higher scores indicating stronger associations [10]. 3. **Token Selection**: A modified Gumbel-softmax approach is used for differentiable selection, allowing for end-to-end training of the pruning process [12]. Experimental Validation and Results Analysis - LightVLA demonstrated superior performance across various tasks in the LIBERO benchmark dataset, achieving an average success rate of 97.4%, which is a 2.9% improvement over the baseline model OpenVLA-OFT [16]. - The framework significantly reduces computational costs, achieving a 59.1% reduction in FLOPs and a 38.2% decrease in latency while maintaining high performance [18]. Ablation Studies and Qualitative Validation - The effectiveness of key design choices was confirmed through ablation studies, showing that the pruning process is task-oriented and dynamically adapts to the requirements of different tasks [20][24]. - LightVLA's pruning strategy focuses on retaining critical tokens related to the task while discarding redundant background tokens [24]. Comparison with MoE - LightVLA differs fundamentally from the Mixture of Experts (MoE) approach, as it prioritizes task performance by selecting visually relevant tokens, whereas MoE focuses on balancing expert load without emphasizing semantic relevance [28].