清华联手理想提出LightVLA：剪掉冗余token，推理速度提升38%！

Core Insights - The article discusses the development of the LightVLA framework, which aims to enhance the efficiency and performance of Vision-Language-Action (VLA) models in robotics by addressing the computational redundancy associated with visual tokens [2][3]. Research Background and Core Challenges - VLA models are essential for embodied intelligence, converting visual information and language instructions into executable robot actions. However, they face a significant bottleneck due to the computational complexity that increases quadratically with the number of visual tokens [2]. - Existing optimization methods often compromise performance for efficiency, leading to the loss of critical semantic information [3]. Existing Optimization Limitations - Trade-off between efficiency and performance: Many token pruning methods sacrifice performance by retaining a fixed number of tokens [3]. - Incompatibility of pruning schemes: Current visual-language model pruning methods focus on global semantics, which does not translate well to VLA models that require local semantic attention [3]. - Poor deployment compatibility: Pruning methods based on attention scores are not adaptable to mainstream inference frameworks, limiting their practical application [3]. LightVLA Framework Design - LightVLA allows the model to autonomously learn to select task-relevant visual tokens through fine-tuning, rather than relying on manually set pruning ratios [4]. - The framework consists of three modules: visual encoder, LLM backbone, and action head, focusing solely on visual token pruning while retaining the [CLS] token for global information [4]. Core Methodology: Three-Stage Pruning Process 1. Query Generation: Task-oriented queries are generated to identify relevant visual tokens without introducing additional parameters [6]. 2. Token Scoring: Each visual token is scored based on its relevance to the task, with higher scores indicating stronger associations [10]. 3. Token Selection: A modified Gumbel-softmax approach is used for differentiable selection, allowing for end-to-end training of the pruning process [12]. Experimental Validation and Results Analysis - LightVLA demonstrated superior performance across various tasks in the LIBERO benchmark dataset, achieving an average success rate of 97.4%, which is a 2.9% improvement over the baseline model OpenVLA-OFT [16]. - The framework significantly reduces computational costs, achieving a 59.1% reduction in FLOPs and a 38.2% decrease in latency while maintaining high performance [18]. Ablation Studies and Qualitative Validation - The effectiveness of key design choices was confirmed through ablation studies, showing that the pruning process is task-oriented and dynamically adapts to the requirements of different tasks [20][24]. - LightVLA's pruning strategy focuses on retaining critical tokens related to the task while discarding redundant background tokens [24]. Comparison with MoE - LightVLA differs fundamentally from the Mixture of Experts (MoE) approach, as it prioritizes task performance by selecting visually relevant tokens, whereas MoE focuses on balancing expert load without emphasizing semantic relevance [28].