cVLA：面向高效相机空间VLA模型的关键位姿预测方法

Core Insights - The article discusses a new approach to Visual-Language-Action (VLA) models that leverages visual language models (VLMs) for efficient robot trajectory prediction, addressing high training costs and data limitations associated with traditional VLA systems [2][3]. Group 1: Introduction and Background - VLA models integrate visual, language, and interaction data to enable fine-grained perception and action generation, but face challenges such as high computational costs, data scarcity, and evaluation benchmarks [3]. - The proposed method utilizes controllable synthetic datasets for training lightweight VLA systems, which can be applied across various domains, particularly in robotics [3]. Group 2: Technical Methodology - The foundational model is based on the pre-trained VLM PaliGemma2, which predicts key poses of the robot's end effector from real-time images, robot states, and task descriptions [6]. - The system employs a single-step prediction approach to enhance training efficiency, focusing on predicting two key trajectory poses rather than full trajectories [6][8]. - The method extends to few-shot imitation learning, allowing the model to infer tasks from demonstration image-trajectory pairs without requiring fine-tuning on new scene images [8]. Group 3: Data Generation and Evaluation - The training dataset is generated using the ManiSkill simulator, which creates diverse environments and tasks, enhancing the model's ability to generalize to real-world scenarios [9][10]. - Real-world evaluation is conducted using the DROID dataset, which includes various scenes and actions, allowing for a comprehensive assessment of the model's performance [11]. Group 4: Experimental Results - Experiments demonstrate that incorporating depth information significantly improves simulation success rates and reduces failure cases [12]. - The model's performance is evaluated across different datasets, with success rates reported at 70% for the easy version and 28% for the hard version of the CLEVR dataset [16][17]. - The article highlights the importance of camera and scene randomization in achieving robustness in real-world applications [16]. Group 5: Inference Strategies - The article discusses the impact of input image cropping on performance, indicating that precise target localization is crucial for successful robot operations [18]. - Various decoding strategies are evaluated, with the proposed beam-search-NMS method outperforming traditional approaches in terms of accuracy and diversity of predicted trajectories [20][23].