ReconVLA：基于重建式VLA模型的机器人感知方法

Core Viewpoint - The article discusses the rapid development of Vision-Language-Action (VLA) models and introduces a new model called ReconVLA, which aims to enhance the precision of robotic actions by improving visual attention and focus on target objects [2][3][27]. Summary by Sections Introduction - Existing VLA models struggle with visual attention in complex scenes, leading to errors in object manipulation. Traditional methods to improve visual localization have not significantly enhanced attention distribution [6]. Model Overview - ReconVLA introduces a reconstructive approach to visual localization, where the model first reconstructs the gaze region before predicting actions. This implicit supervision forces the model to focus on the correct object, improving action precision [8][11][14]. Methodology - The framework consists of two branches: visual reconstruction and action prediction. The model uses a frozen visual tokenizer to encode the gaze region and employs a diffusion transformer for denoising and reconstruction [13][16]. - A large-scale dataset with over 100,000 trajectories and 2 million samples was created to pre-train the model, enhancing its visual generalization and implicit grounding capabilities [19]. Performance Results - In simulations, ReconVLA achieved a near 95% success rate in long-term tasks, outperforming existing methods. The model demonstrated strong transferability to unseen objects, maintaining over 40% success rates even with novel items [9][26]. - The model's performance in real-world tasks, such as stacking bowls and placing fruits, showed significant improvements over previous models, achieving up to 90% success in specific tasks [25]. Contributions - ReconVLA is the first model to utilize a gaze region reconstruction paradigm, significantly enhancing visual attention and action prediction accuracy. The extensive pre-training on diverse datasets has established a solid foundation for its performance in various tasks [14][27]. Conclusion - The study highlights the limitations of current VLA models in visual focus and presents ReconVLA as a solution that effectively directs attention to key objects, paving the way for more reliable multi-modal robotic control [27].