别让vision拖累VLA中的action！

Core Insights - The article discusses the challenges and advancements in Visual-Language-Action (VLA) models used in robotics, particularly focusing on the limitations of existing models that rely on low-dimensional sparse action signals to supervise high-dimensional dense visual inputs, which restricts overall performance [6][9]. Research Background - VLA models have shown significant progress but still face issues due to the mismatch between action supervision signals and visual inputs, leading to underutilization of the model's representation capabilities [6]. - The introduction of a visual prediction mechanism is proposed to enhance action generation by predicting future visual states, although high-dimensional visual states often contain redundant information that complicates the training process [8]. Proposed Solutions - Decoupled Visual Forecasting (DVF) is introduced to alleviate the burden on the backbone network by automatically capturing implicit actions and enhancing explicit action generation [7]. - A progressive pre-training approach is suggested to gradually integrate different modalities, introducing language supervision to retain the understanding and reasoning capabilities of the VLA backbone [7]. - Adaptive Temporal Ensemble (ATE) is proposed to dynamically adjust the integration strength during inference, reducing computational costs while maintaining action stability [14]. Architecture Design - The DVF method incorporates implicit action queries and a separate diffusion DVF head, allowing the model to focus on frame-to-frame differences rather than predicting complete future frames [10]. - A progressive training scheme is designed to introduce visual, language, and action information in phases to avoid competition between modalities and achieve stable optimization [10]. Experimental Analysis - Mantis, the proposed model, outperforms existing baseline methods in three out of four tasks on the LIBERO benchmark, achieving the highest average success rate of 96.7% [16][18]. - The convergence speed of Mantis is significantly faster compared to traditional visual prediction methods like UnifiedVLA [20]. - Experiments demonstrate the effectiveness of language supervision in retaining the backbone's capabilities, with Mantis outperforming in both in-domain and out-of-domain instruction tasks [20]. Team Introduction - The research team, SJTU Deng Lab, focuses on generative models and large language models, collaborating with renowned institutions and maintaining a strong research output in top-tier journals and conferences [23].