Franka Panda机械臂

Search documents
穆尧团队最新!离散扩散引入VLA,支持精确动作建模和一致性训练
具身智能之心· 2025-09-02 00:03
Core Viewpoint - The article discusses the introduction of the Discrete Diffusion VLA model, which integrates discrete diffusion techniques into the Vision-Language-Action (VLA) framework, enhancing the efficiency and accuracy of robotic action decoding [4][7]. Group 1: Background and Problem Statement - The VLA model enables robots to understand visual and language inputs and execute corresponding action sequences. Current VLA frameworks typically adapt large pre-trained visual-language models (VLM) by adding an action generation head [4]. - Existing decoding methods fall into two categories: autoregressive (AR) methods, which generate actions sequentially, and continuous diffusion methods, which treat action trajectories as continuous signals [4][6]. Group 2: Proposed Solution - The Discrete Diffusion VLA model introduces a novel approach by incorporating discrete diffusion into action decoding, utilizing a single Transformer to unify visual, language, and action modalities without the need for additional training modules [6][12]. - The model employs a "first easy, then difficult" adaptive decoding strategy, allowing for parallel decoding of actions and error correction, significantly improving accuracy [12][18]. Group 3: Performance Metrics - In the LIBERO task with the Franka Panda robotic arm, the model achieved a success rate of 96.3%, outperforming traditional AR and continuous diffusion models [2][12]. - The Google robot demonstrated a visual matching rate of 71.2%, while the WidowX robot achieved a 49.3% overall success rate in real-simulation transfer scenarios, showcasing the model's robustness [2][25]. Group 4: Experimental Results - The Discrete Diffusion VLA model consistently outperformed benchmarks, with an average success rate of 96.3% across various tasks, surpassing the closest model, OpenVLA-OFT, by 0.8% [21][22]. - The model's performance in visual matching and variant aggregation was also superior, achieving an overall average success rate of 64.1% in diverse scenarios [23][24]. Group 5: Ablation Studies - Ablation studies indicated that the adaptive decoding strategy significantly enhances performance, with the "max confidence" approach yielding a 97.4% success rate, outperforming other strategies [27]. - The temperature scheduling method used in the model also proved effective, achieving a 97.4% success rate, validating the synergy between temperature adjustment and adaptive decoding [28].
cVLA:面向高效相机空间VLA模型的关键位姿预测方法
具身智能之心· 2025-07-06 11:54
Core Insights - The article discusses a new approach to Visual-Language-Action (VLA) models that leverages visual language models (VLMs) for efficient robot trajectory prediction, addressing high training costs and data limitations associated with traditional VLA systems [2][3]. Group 1: Introduction and Background - VLA models integrate visual, language, and interaction data to enable fine-grained perception and action generation, but face challenges such as high computational costs, data scarcity, and evaluation benchmarks [3]. - The proposed method utilizes controllable synthetic datasets for training lightweight VLA systems, which can be applied across various domains, particularly in robotics [3]. Group 2: Technical Methodology - The foundational model is based on the pre-trained VLM PaliGemma2, which predicts key poses of the robot's end effector from real-time images, robot states, and task descriptions [6]. - The system employs a single-step prediction approach to enhance training efficiency, focusing on predicting two key trajectory poses rather than full trajectories [6][8]. - The method extends to few-shot imitation learning, allowing the model to infer tasks from demonstration image-trajectory pairs without requiring fine-tuning on new scene images [8]. Group 3: Data Generation and Evaluation - The training dataset is generated using the ManiSkill simulator, which creates diverse environments and tasks, enhancing the model's ability to generalize to real-world scenarios [9][10]. - Real-world evaluation is conducted using the DROID dataset, which includes various scenes and actions, allowing for a comprehensive assessment of the model's performance [11]. Group 4: Experimental Results - Experiments demonstrate that incorporating depth information significantly improves simulation success rates and reduces failure cases [12]. - The model's performance is evaluated across different datasets, with success rates reported at 70% for the easy version and 28% for the hard version of the CLEVR dataset [16][17]. - The article highlights the importance of camera and scene randomization in achieving robustness in real-world applications [16]. Group 5: Inference Strategies - The article discusses the impact of input image cropping on performance, indicating that precise target localization is crucial for successful robot operations [18]. - Various decoding strategies are evaluated, with the proposed beam-search-NMS method outperforming traditional approaches in terms of accuracy and diversity of predicted trajectories [20][23].