直面VLA的「阿喀琉斯之踵」：TeleAI用「反探索」提升具身推理稳定性

Core Insights - The article discusses the rapid development of Vision-Language-Action (VLA) models in embodied intelligence, highlighting their unprecedented generalization capabilities but also addressing the critical issue of instability during the reasoning phase [2][3][4]. - A novel framework named TACO (Test-time Anti-exploration via pseudo-Counts) is introduced to tackle the reasoning instability in VLA models, providing a solid theoretical foundation and practical solutions [2][8]. Group 1: VLA Model Challenges - VLA models, despite their impressive average performance, exhibit extreme sensitivity to initial noise during inference, leading to success rates that can fluctuate between 0% and 80% for the same model [4][6]. - The instability is attributed to two main factors: the retention of redundant action patterns from diverse pre-training data and the multimodal nature of fine-tuning datasets, which may include suboptimal strategies [7][6]. Group 2: TACO Framework - TACO draws inspiration from the "anti-exploration" principle in offline reinforcement learning, aiming to constrain generated actions to successful patterns within the fine-tuning dataset [9][11]. - The framework includes three key components: a Coupled Pseudo-Count Estimator that utilizes the VLA model's internal representation, ensuring efficient validation without additional training [11][12]. Group 3: Implementation and Results - TACO employs a two-stage reasoning process: generating diverse action candidates and validating them through pseudo-counts, which are calculated using a trained CFN [17][18]. - The implementation of a Shared Observation Key-Value Cache significantly reduces computational costs, allowing for efficient real-time operation with minimal latency [20][21]. Group 4: Experimental Validation - Comprehensive evaluations across multiple simulated benchmarks and a dual-arm robot platform demonstrate TACO's effectiveness, with average success rates improving by 16% in real-world tasks [22][32]. - Specific tasks, such as "organizing paper and pens," showed a remarkable 25% increase in success rates, highlighting TACO's ability to filter out suboptimal behaviors [32][33]. Group 5: Future Directions - TACO not only addresses practical challenges but also opens new perspectives for VLA research, suggesting potential expansions into more complex multi-task scenarios and integration with world models for enhanced long-term planning capabilities [35].