TACO框架
Search documents
直面VLA的「阿喀琉斯之踵」:TeleAI提升具身推理稳定性
具身智能之心· 2025-12-25 01:41
Core Insights - The article discusses the rapid development of Vision-Language-Action (VLA) models in embodied intelligence, highlighting the challenge of instability during the reasoning phase, which hinders real-world application [1][3] - A new framework called TACO (Test-time Anti-exploration via pseudo-COunts) is introduced to address this instability, demonstrating significant improvements in task success rates through experimental validation [1][4] Group 1: VLA Model Challenges - VLA models exhibit extreme sensitivity to initial noise during inference, leading to success rates that can vary dramatically from 0% to 80% even after fine-tuning [4][5] - The instability is attributed to two main factors: the retention of redundant action patterns from diverse training data and the multimodal nature of fine-tuning datasets, which may encode suboptimal strategies [6][8] Group 2: TACO Framework - TACO employs an "anti-exploration" principle from offline reinforcement learning to constrain generated actions within the successful patterns of the fine-tuning dataset, avoiding irrelevant action patterns [10][12] - The framework includes a Coupled Pseudo-Count Estimator that utilizes the VLA model's internal representation to validate actions without requiring additional training resources [12][13] Group 3: Performance Improvements - TACO significantly enhances the average success rate of the π0 model from 32.2% to 41.3% in simulated environments, with notable improvements in challenging tasks [24][26] - In real-world robot experiments, TACO increased the average success rate from 40% to 56%, with specific tasks seeing improvements of up to 25% [34][32] Group 4: Technical Mechanisms - The TACO framework's two-stage reasoning process involves generating diverse action candidates and validating them through pseudo-counts, ensuring high fidelity in action representation [18][19] - The use of a shared observation key-value cache reduces computational costs significantly, allowing for efficient real-time operation [21][22] Group 5: Future Directions - TACO not only addresses practical issues but also opens new perspectives for VLA research, with plans to extend its application to more complex multi-task scenarios and enhance long-term planning capabilities [39][38]
直面VLA的「阿喀琉斯之踵」:TeleAI用「反探索」提升具身推理稳定性
机器之心· 2025-12-24 07:40
Core Insights - The article discusses the rapid development of Vision-Language-Action (VLA) models in embodied intelligence, highlighting their unprecedented generalization capabilities but also addressing the critical issue of instability during the reasoning phase [2][3][4]. - A novel framework named TACO (Test-time Anti-exploration via pseudo-Counts) is introduced to tackle the reasoning instability in VLA models, providing a solid theoretical foundation and practical solutions [2][8]. Group 1: VLA Model Challenges - VLA models, despite their impressive average performance, exhibit extreme sensitivity to initial noise during inference, leading to success rates that can fluctuate between 0% and 80% for the same model [4][6]. - The instability is attributed to two main factors: the retention of redundant action patterns from diverse pre-training data and the multimodal nature of fine-tuning datasets, which may include suboptimal strategies [7][6]. Group 2: TACO Framework - TACO draws inspiration from the "anti-exploration" principle in offline reinforcement learning, aiming to constrain generated actions to successful patterns within the fine-tuning dataset [9][11]. - The framework includes three key components: a Coupled Pseudo-Count Estimator that utilizes the VLA model's internal representation, ensuring efficient validation without additional training [11][12]. Group 3: Implementation and Results - TACO employs a two-stage reasoning process: generating diverse action candidates and validating them through pseudo-counts, which are calculated using a trained CFN [17][18]. - The implementation of a Shared Observation Key-Value Cache significantly reduces computational costs, allowing for efficient real-time operation with minimal latency [20][21]. Group 4: Experimental Validation - Comprehensive evaluations across multiple simulated benchmarks and a dual-arm robot platform demonstrate TACO's effectiveness, with average success rates improving by 16% in real-world tasks [22][32]. - Specific tasks, such as "organizing paper and pens," showed a remarkable 25% increase in success rates, highlighting TACO's ability to filter out suboptimal behaviors [32][33]. Group 5: Future Directions - TACO not only addresses practical challenges but also opens new perspectives for VLA research, suggesting potential expansions into more complex multi-task scenarios and integration with world models for enhanced long-term planning capabilities [35].