VLA-Pruner：面向高效VLA推理的时序感知视觉token剪枝

Group 1 - The core challenge of VLA models lies in their ability to integrate visual scene perception, natural language understanding, and action execution, which results in significant computational overhead due to the high number of visual tokens compared to text tokens [2][4]. - Existing pruning methods for visual tokens are flawed as they primarily focus on semantic relevance, neglecting the distinct needs of high-level semantic understanding and low-level action execution, leading to performance drops at high pruning rates [3][4]. - A key observation is that the temporal continuity of robot operations allows for the estimation of necessary visual tokens for current actions based on historical attention trends, providing a breakthrough in addressing the limitations of existing methods [5]. Group 2 - The VLA-Pruner is designed to retain both semantic understanding and action execution tokens under a given computational budget, achieving efficient inference without performance loss through a dual-level criterion and selection strategy [6][10]. - The dual-level importance criteria include semantic relevance based on pre-fill attention scores and action-level importance estimated through temporal smoothing, ensuring a comprehensive approach to token selection [7][9]. - The method employs a "merge-filter" mechanism to maximize relevance and minimize redundancy, ensuring that all critical tokens for both semantic understanding and action execution are preserved [10][11]. Group 3 - Experimental results demonstrate that at a 50% pruning rate, VLA-Pruner not only maintains performance but also improves success rates, with OpenVLA showing an average increase of 2.45% [16]. - The VLA-Pruner exhibits robustness across different scenarios, achieving a success rate of 96.8% in the SIMPLER environment at a 75% pruning rate, significantly outperforming baseline methods [19][20]. - Efficiency improvements are notable, with FLOPs reduced to approximately 60% of the original model at a 50% pruning rate and achieving up to 1.8 times faster inference speeds [26][27]. Group 4 - The core contributions of the study include the introduction of a dual-level pruning criterion that addresses the inherent flaws of existing methods and the design of a plug-and-play pruning framework that enhances inference efficiency without altering the model architecture [31]. - Limitations include potential inaccuracies in action attention estimation in dynamic scenarios with rapid perspective shifts or target changes, suggesting areas for future optimization [31]. - Future directions involve the development of adaptive prediction modules and the integration of additional techniques such as quantization and layer pruning to further enhance deployment efficiency [31].