具身VLA后训练：TeleAI提出潜空间引导的VLA跨本体泛化方法

Core Insights - The article discusses the challenges and solutions related to the Vision-Language-Action (VLA) models in the context of cross-embodiment adaptation, highlighting the limitations of existing models and the introduction of a new framework called "Align then Steer" (ATE) to enhance performance in post-training scenarios [1][2][10]. Group 1: Challenges in VLA Models - Current VLA models require extensive target domain data for post-training, often needing dozens to hundreds of hours, leading to significant mismatches in action distributions between pre-training and post-training phases [1][10]. - The marginal returns of simply stacking data during post-training diminish rapidly, making it ineffective for fitting the action distribution of target scenarios [1][11]. Group 2: ATE Framework Introduction - The ATE framework proposed by the TeleAI team aims to align action distributions in latent space, allowing for efficient adaptation of VLA models without altering their core architecture [2][10]. - ATE transitions the focus from adjusting model architecture to adjusting distributions, significantly reducing data requirements for cross-embodiment adaptation [2][15]. Group 3: ATE Framework Mechanism - The ATE framework consists of two main phases: aligning action distributions in latent space and guiding the post-training strategy updates using a classifier [14][19]. - In the alignment phase, two small Variational Autoencoders (VAEs) are constructed to embed action data into a unified latent space, ensuring that the adapted actions closely follow the pre-trained distribution [18][19]. - The guiding phase integrates a classifier guidance function to measure the difference between generated actions and target distributions, effectively steering the model outputs towards the desired action distribution [21][22]. Group 4: Experimental Results - The ATE algorithm demonstrated an average increase of 9.8% in multi-task success rates in simulation evaluations compared to direct post-training, with a maximum success rate gain of 32% in real-world scenarios [23][24]. - The framework showed robust performance under various conditions, including lighting changes and external disturbances, maintaining task-related focus and recovery capabilities [29][30]. Group 5: Conclusion and Future Directions - The ATE framework provides a viable solution to the challenges of data scarcity and cross-embodiment adaptation in VLA models, allowing for efficient and robust training without the need for extensive data collection or full model retraining [30]. - This framework can serve as a plug-and-play module compatible with various mainstream VLA models, enhancing their post-training cross-embodiment generalization capabilities [30].