VLA别再「走神」：即插即用提升视觉泛化，相对Pi0.5提升18%

Core Insights - The article discusses the development of DeepVision-VLA, a visual enhancement framework for robot operations, which addresses the issue of visual information degradation in deep action prediction models [6][7][24]. Group 1: Research Findings - The research team found that the reliance on key visual tokens decreases as the layers of the VLA model deepen, leading to a decline in sensitivity to critical visual information during action prediction [4][11][21]. - DeepVision-VLA incorporates a Vision-Language Mixture-of-Transformers (VL-MoT) framework and Action-Guided Visual Pruning (AGVP) strategy to enhance the model's ability to focus on task-relevant visual areas [8][24][26]. Group 2: Performance Metrics - In simulations using the RLBench simulator, DeepVision-VLA achieved an average success rate of 83%, which is an 18% improvement over the baseline model Pi0.5 [8][35]. - In real-world tasks, DeepVision-VLA reached a 91.7% average success rate, demonstrating enhanced precision and stability in complex operations [43]. Group 3: Experimental Validation - The model was tested under various conditions, including unseen backgrounds and lighting, and maintained stable performance, indicating robust visual modeling capabilities [46][48]. - The experiments showed that even with significant visual token removal in deeper layers, the impact on action prediction was limited, confirming the model's improved efficiency in utilizing visual information [25][30].