VQ-VLA：大规模合成数据驱动动作tokenizer，推理速度提升近三倍

Core Insights - The article discusses the challenges faced by Visual-Language-Action (VLA) models in multimodal robotic control, specifically focusing on action representation efficiency and data dependency bottlenecks [3][4]. Group 1: Challenges in VLA Models - Action representation efficiency is low due to traditional continuous action discretization methods, which struggle to capture complex spatiotemporal dynamics, leading to increased cumulative errors in long-duration tasks [4]. - The high cost of real robot data collection limits the generalization ability of models, creating a data dependency bottleneck [4]. Group 2: Proposed Solutions - A universal action tokenizer framework based on Convolutional Residual VQ-VAE is proposed to replace traditional discretization methods [4]. - The article demonstrates that the difference between synthetic and real domain action trajectories is minimal, allowing for the use of a significantly larger scale of synthetic data (100 times previous work) to train the tokenizer [4]. - The VLA model's performance is optimized across three core metrics, with the success rate for long-duration tasks increasing by up to 30% in real robot experiments [4]. Group 3: Key Technical Solutions - The Convolutional Residual VQ-VAE architecture employs 2D temporal convolution layers instead of traditional MLPs, resulting in a 6.6% improvement in success rates for the LIBERO-10 task [7]. - The action execution frequency improved from 4.16 Hz to 11.84 Hz, enhancing inference speed [9][18]. - A multi-step action prediction approach reduces cumulative errors, contributing to long-duration robustness [9]. Group 4: Experimental Findings - In simulated environments, the VQ model achieved a success rate of 80.98% in LIBERO-90, surpassing the baseline by 7.45% [17]. - For short-duration tasks, the VQ model's success rate was 60.0% in the "flip the pot" task compared to a baseline of 30.0% [17]. - In long-duration tasks, the VQ model achieved a success rate of 30.0% for "putting toys in a drawer" versus 5.0% for the baseline, and 50.0% for "putting all cups in a basket" compared to 15.0% for the baseline [17]. Group 5: Future Directions - The article suggests expanding the dataset by integrating larger-scale synthetic datasets, such as RLBench [19]. - There is a focus on model lightweighting through distillation and quantization techniques to further accelerate inference [19]. - Exploration of enhanced designs, such as action frequency conditional encoding, is recommended for architectural improvements [19].