EmbodyX最新！VOTE：集成投票&优化加速VLA模型的通用框架，吞吐量加速35倍！

Core Insights - The article discusses the limitations of existing VLA models in generalizing to new objects and unfamiliar environments, prompting the development of a more efficient action prediction method called VOTE [4][6][9]. Group 1: Background and Motivation - The challenge of creating a universal robotic strategy that can handle diverse tasks and real-world interactions has been a core focus in robotics research [6]. - VLA models have shown excellent performance in familiar environments but struggle with generalization in unseen scenarios, leading to the exploration of methods to enhance robustness [7][8]. Group 2: VOTE Methodology - VOTE is introduced as a lightweight VLA model that optimizes trajectory using an ensemble voting strategy, significantly improving inference speed and reducing computational costs [9][14]. - The model eliminates the need for additional visual modules and diffusion techniques, relying solely on the VLM backbone and introducing a special token to streamline action prediction [9][18]. - The action sampling technique employs an ensemble voting mechanism to enhance model performance by aggregating predictions from previous steps, thus improving stability and robustness [22][23]. Group 3: Performance and Evaluation - Experimental results indicate that VOTE achieves state-of-the-art performance, with a 20% increase in average success rate on the LIBERO task suite and a 3% improvement over CogACT on the SimplerEnv WidowX robot [9][28]. - The model demonstrates a 35-fold increase in throughput on edge devices like NVIDIA Jetson Orin, showcasing its efficiency for real-time applications [9][31]. - VOTE's performance is superior to existing models, achieving a throughput of 42Hz on edge platforms while maintaining minimal memory overhead [31][32].