穆尧团队最新！离散扩散引入VLA，支持精确动作建模和一致性训练

Core Viewpoint - The article discusses the introduction of the Discrete Diffusion VLA model, which integrates discrete diffusion techniques into the Vision-Language-Action (VLA) framework, enhancing the efficiency and accuracy of robotic action decoding [4][7]. Group 1: Background and Problem Statement - The VLA model enables robots to understand visual and language inputs and execute corresponding action sequences. Current VLA frameworks typically adapt large pre-trained visual-language models (VLM) by adding an action generation head [4]. - Existing decoding methods fall into two categories: autoregressive (AR) methods, which generate actions sequentially, and continuous diffusion methods, which treat action trajectories as continuous signals [4][6]. Group 2: Proposed Solution - The Discrete Diffusion VLA model introduces a novel approach by incorporating discrete diffusion into action decoding, utilizing a single Transformer to unify visual, language, and action modalities without the need for additional training modules [6][12]. - The model employs a "first easy, then difficult" adaptive decoding strategy, allowing for parallel decoding of actions and error correction, significantly improving accuracy [12][18]. Group 3: Performance Metrics - In the LIBERO task with the Franka Panda robotic arm, the model achieved a success rate of 96.3%, outperforming traditional AR and continuous diffusion models [2][12]. - The Google robot demonstrated a visual matching rate of 71.2%, while the WidowX robot achieved a 49.3% overall success rate in real-simulation transfer scenarios, showcasing the model's robustness [2][25]. Group 4: Experimental Results - The Discrete Diffusion VLA model consistently outperformed benchmarks, with an average success rate of 96.3% across various tasks, surpassing the closest model, OpenVLA-OFT, by 0.8% [21][22]. - The model's performance in visual matching and variant aggregation was also superior, achieving an overall average success rate of 64.1% in diverse scenarios [23][24]. Group 5: Ablation Studies - Ablation studies indicated that the adaptive decoding strategy significantly enhances performance, with the "max confidence" approach yielding a 97.4% success rate, outperforming other strategies [27]. - The temperature scheduling method used in the model also proved effective, achieving a 97.4% success rate, validating the synergy between temperature adjustment and adaptive decoding [28].