统一离散与连续扩散！人大 & 蚂蚁提出 LLaDA-o，高效达成多模态理解与生成

Core Insights - The article discusses the development of LLaDA-o, an efficient and length-adaptive omni diffusion model, which addresses the challenges of integrating discrete text diffusion and continuous image diffusion into a unified framework [3][19]. Group 1: Model Performance - LLaDA-o achieves state-of-the-art (SOTA) performance in both multi-modal understanding and text-to-image generation tasks, marking a significant advancement in the field of multi-modal diffusion models [3][19]. - In multi-modal understanding benchmarks, LLaDA-o outperforms existing diffusion models, achieving notable scores such as 66.1 in MathVista and 87.9 in ChartQA, solidifying its position as the leading model in this category [7][9]. - The model also excels in fine-grained generation tasks, scoring 87.04 in DPG-Bench, surpassing previous strong models like SD3-Medium and Lumina-DiMOO [9][11]. Group 2: Technical Innovations - LLaDA-o employs a Mixture of Diffusion (MoD) framework, which features two specialized diffusion experts: an Understanding Expert for discrete masked diffusion and a Generation Expert for continuous diffusion, allowing for effective optimization across different modalities [12][14]. - The model incorporates intra-modality bidirectional attention to enhance efficiency by reducing redundant calculations during inference, thus improving overall performance [15]. - An adaptive length augmentation strategy is introduced, enabling the model to dynamically adjust output lengths based on context, addressing the challenges of variable-length text generation without altering the underlying architecture [17]. Group 3: Future Implications - The successful integration of discrete language understanding and continuous visual generation within the MoD framework positions LLaDA-o as a strong contender against autoregressive models, paving the way for future developments in non-autoregressive architectures [19][20]. - The ongoing evolution of large language diffusion models suggests that unified models based on diffusion architecture will play a crucial role in the landscape of general artificial intelligence [20].