LightVLA：你的VLA真的可以又强又快！

Core Insights - LightVLA is an innovative differentiable token pruning framework designed for vision-language-action (VLA) models, enabling them to focus on critical visual information while significantly reducing computational costs and improving performance [2][8]. Group 1: LightVLA Overview - LightVLA addresses the computational challenges faced by VLA models on resource-constrained platforms by implementing adaptive and performance-driven visual token pruning [2]. - The framework generates dynamic queries to assess the importance of visual tokens and employs Gumbel softmax for differentiable token selection, allowing for the retention of the most informative tokens while discarding irrelevant ones [2][3]. Group 2: Performance Metrics - Experimental results indicate that LightVLA outperforms various VLA models and existing token pruning methods across multiple tasks in the LIBERO benchmark, achieving a 59.1% reduction in computational load (FLOPs) and a 38.2% decrease in latency, while increasing task success rate by 2.6% [3][8]. - The success rate achieved by LightVLA is reported to be 97.4%, marking a significant improvement in efficiency and performance [8]. Group 3: Research Significance - LightVLA is the first framework to apply adaptive visual token pruning to VLA tasks while simultaneously optimizing efficiency and performance, representing a critical advancement towards efficient, powerful, and practical real-time robotic systems [3].