Ant Group Co., Ltd.-跳过“逐字生成”，蚂蚁集团赵俊博：扩散模型让我们能直接修改Token

Core Insights - The main focus of the news is on the emerging diffusion architecture for language models, which offers advantages over traditional autoregressive models in terms of speed and computational efficiency [1][4][20]. Group 1: Diffusion Architecture Advantages - Diffusion architecture allows for direct modification and control of tokens during inference, eliminating the need to regenerate entire segments of content as required by autoregressive models [1][5]. - The newly released LLaDA 2.0 model has achieved a scale of 100 billion parameters, marking a significant milestone in the development of diffusion language models [1][20]. - Diffusion models are described as "data-hungry," requiring larger datasets for training compared to autoregressive models, but they can absorb data more quickly [5][8]. Group 2: Technical Developments - The LLaDA model employs a "fill-in-the-blank" prediction method, which contrasts with the sequential token generation of autoregressive models [6][8]. - The architecture includes both global and causal attention mechanisms to enhance computational efficiency and maintain coherence in generated sequences [16]. - The research team has made significant strides in addressing architectural challenges, including the integration of mixture of experts (MoE) within the diffusion framework [19]. Group 3: Industry Impact and Future Directions - Major tech companies, including Google and ByteDance, are actively exploring diffusion models, indicating a growing interest in this technology [1][19]. - The development of a new inference engine, dInfer, is expected to enhance the performance of diffusion models, with potential for significant speed improvements in key applications [24][25]. - The community is encouraged to collaborate in building the ecosystem around diffusion language models, which are still in the early stages of development [27].