WorldVLA：世界模型实现视觉-动作双向增强，抓取精度显著提升

Core Viewpoint - WorldVLA is introduced as a self-regressive action world model that integrates action and image understanding and generation, outperforming independent action and world models through mutual enhancement [4][7][9]. Group 1: Model Definition and Components - WorldVLA combines visual, language, and action (VLA) models with a world model to predict future images based on actions and visual understanding [4][6]. - The model employs three independent tokenizers for images, text, and actions, sharing the same vocabulary for unified cross-modal understanding [7][14]. - The action model generates subsequent actions based on image observations, while the world model predicts future visual states, enhancing decision-making in action models [6][29]. Group 2: Performance and Evaluation - Experiments show that WorldVLA achieves a 4% higher success rate in grasping tasks compared to traditional action models and reduces Fréchet Video Distance (FVD) by 10% compared to standard world models [8][27]. - The attention mask strategy significantly mitigates performance degradation in action sequence generation, improving grasping success rates by 4% to 23% [8][32]. - The model's performance correlates positively with image resolution, indicating that higher resolution provides better visual information for robotic tasks [27]. Group 3: Training Strategy and Data - WorldVLA is trained using a mix of action model data and world model data, enhancing action generation through understanding of environmental physics [16][22]. - The training involves generating actions based on text instructions and image observations, while the world model predicts the next image frame based on current observations and actions [17][18]. - The loss function balances contributions from action and world model data, ensuring effective training despite the disparity in token counts [22]. Group 4: Contributions and Innovations - The introduction of the attention mask strategy allows for independent generation of actions, reducing error propagation in sequential action generation [19][20]. - WorldVLA demonstrates superior performance in generating longer video sequences compared to pure world models, highlighting the benefits of integrating action models [31]. - The model's architecture and training strategies reveal the potential for enhanced task performance through pre-training with world model data [36].