理想发布机器人领域VLA模型优化框架

Core Viewpoint - The article discusses the introduction of LightVLA, a novel adaptive visual token pruning framework that enhances both the success rate and operational efficiency of robot VLA models, addressing the challenges of traditional models in real-world applications [2][3]. Group 1: Technology Framework - LightVLA operates through three main steps: Query Generation, Token Scoring, and Token Selection, allowing for dynamic and parameter-free generation of token queries based on the importance of visual information [5]. - The framework utilizes Gumbel-softmax sampling to enable a differentiable token selection process, facilitating end-to-end learning and optimization [5]. - In benchmark tests, LightVLA improved the average task success rate from 94.5% to 97.4%, reduced floating-point operations (FLOPS) by 59.1%, and decreased end-to-end latency by 38.2% (from 34ms to 21ms) compared to OpenVLA-OFT [5]. Group 2: Performance and Efficiency - LightVLA demonstrates a good compression rate, retaining approximately 78 visual tokens on average, while the baseline model processes 512 tokens, indicating significant redundancy in visual input [6]. - It is the only VLA acceleration method that enhances model performance while achieving acceleration, surpassing all other existing acceleration methods [7].