新国立提出VLA-4D：4D感知VLA模型，实现时空连贯的机器人操作

Core Concept - The article introduces the 4D perception VLA model, which aims to enhance the spatial and temporal coherence of robotic operations by integrating spatial and temporal information, thereby improving visual reasoning and action planning [2][4]. Group 1: Model Design and Technical Details - The VLA-4D model innovates through dual spatial-temporal fusion, embedding 4D (3D space + 1D time) information into visual representations for reasoning and incorporating time variables into action representations for planning [5]. - The 2D VLA model relies on single-frame image input, leading to rough visual reasoning and spatial inaccuracies, while the 3D VLA model lacks explicit temporal modeling, resulting in motion stuttering [6]. - A "4D embedding + cross-attention fusion" representation method is designed to address the lack of spatial-temporal precision in visual reasoning [7][10]. Group 2: Dataset and Training Process - The existing VLA dataset lacks temporal action annotations, prompting an expansion based on the LIBERO dataset, which includes 40 sub-tasks and 150,000 visual-language-action samples [15][16]. - A two-stage training process significantly improves task success rates and reduces execution times compared to single fine-tuning [17][18]. Group 3: Experimental Validation and Key Findings - In the LIBERO benchmark, the VLA-4D model outperforms state-of-the-art models with a success rate of 97.4% and an average completion time of 5.8 seconds across various tasks [19][21]. - The model demonstrates superior generalization capabilities in zero-shot tasks, maintaining higher success rates and shorter execution times [20]. - Ablation studies confirm the necessity of visual representation modules, showing that the combination of spatial and temporal embeddings enhances success rates and reduces completion times [24][27].