直面VLA的「阿喀琉斯之踵」:TeleAI提升具身推理稳定性
具身智能之心·2025-12-25 01:41

Core Insights - The article discusses the rapid development of Vision-Language-Action (VLA) models in embodied intelligence, highlighting the challenge of instability during the reasoning phase, which hinders real-world application [1][3] - A new framework called TACO (Test-time Anti-exploration via pseudo-COunts) is introduced to address this instability, demonstrating significant improvements in task success rates through experimental validation [1][4] Group 1: VLA Model Challenges - VLA models exhibit extreme sensitivity to initial noise during inference, leading to success rates that can vary dramatically from 0% to 80% even after fine-tuning [4][5] - The instability is attributed to two main factors: the retention of redundant action patterns from diverse training data and the multimodal nature of fine-tuning datasets, which may encode suboptimal strategies [6][8] Group 2: TACO Framework - TACO employs an "anti-exploration" principle from offline reinforcement learning to constrain generated actions within the successful patterns of the fine-tuning dataset, avoiding irrelevant action patterns [10][12] - The framework includes a Coupled Pseudo-Count Estimator that utilizes the VLA model's internal representation to validate actions without requiring additional training resources [12][13] Group 3: Performance Improvements - TACO significantly enhances the average success rate of the π0 model from 32.2% to 41.3% in simulated environments, with notable improvements in challenging tasks [24][26] - In real-world robot experiments, TACO increased the average success rate from 40% to 56%, with specific tasks seeing improvements of up to 25% [34][32] Group 4: Technical Mechanisms - The TACO framework's two-stage reasoning process involves generating diverse action candidates and validating them through pseudo-counts, ensuring high fidelity in action representation [18][19] - The use of a shared observation key-value cache reduces computational costs significantly, allowing for efficient real-time operation [21][22] Group 5: Future Directions - TACO not only addresses practical issues but also opens new perspectives for VLA research, with plans to extend its application to more complex multi-task scenarios and enhance long-term planning capabilities [39][38]