Diffusion Large Language Model (DLLM)
Search documents
首个开源扩散VLA:Unified DVLA!实现SOTA性能+4倍加速
具身智能之心· 2025-11-07 00:05
Core Insights - The article discusses the development of the Unified Diffusion VLA (UD-VLA) architecture, which integrates image generation and action prediction within a unified framework, leveraging the advantages of Diffusion Large Language Models (DLLM) [3][19]. Group 1: Unified VLA Model - The motivation behind the Unified VLA model is to utilize DLLM's strengths in generating and understanding data, focusing on the mutual benefits of image generation and action prediction [3]. - The Joint Discrete Denoising Diffusion Process (JD3P) is introduced, allowing for the simultaneous generation of actions and images during the denoising process [9][10]. Group 2: Technical Mechanisms - Unified tokenization is employed to convert text, images, and actions into a single multimodal sequence, marked by special tokens to differentiate modalities [7]. - A hybrid attention mechanism is implemented to maintain causal relationships within different modalities, ensuring that actions benefit from the denoising of images [7]. Group 3: Training and Inference - The training process consists of two phases: first, post-training on large video datasets to inject future image generation capabilities, and second, jointly optimizing image and action generation [10]. - Inference involves parallel decoding with adaptive masking, initializing all positions with a mask and iterating a few times for refinement [11][12]. Group 4: Performance Evaluation - The UD-VLA model achieves state-of-the-art performance, demonstrating a fourfold speedup compared to autoregressive models while maintaining high action quality [3][19]. - Comprehensive evaluations on benchmarks like CALVIN, LIBERO, and SIMPLER show UD-VLA's superior performance in long-horizon robotic manipulation tasks [15][16].