Workflow
LLaDA
icon
Search documents
扩散语言模型也有MoE版本了!蚂蚁&人大从头训练LLaDA-MoE,即将完全开源
机器之心· 2025-09-12 11:31
机器之心报道 编辑:张倩 挑战自回归的扩散语言模型刚刚迎来了一个新里程碑:蚂蚁集团和人大联合团队用 20T 数据,从零训练出了业界首个原生 MoE 架构扩散语 言模型 LLaDA-MoE。该模型虽然激活参数仅 1.4B,但性能可以比肩参数更多的自回归稠密模型 Qwen2.5-3B,而且推理速度更快。这为扩散 语言模型的技术可行性提供了关键验证。 万万没想到,做奥数题都能拿金牌的模型,却不会「倒着背诗」。 说完全不会,倒也不严谨。因为如果允许模型「深度思考」,给诗的每个字都编上号,然后倒着排一下,这诗也能背出来。然而,这与人类倒背文本的方式并不 一样 —— 人类记忆诗词时,往往不是逐字死记,而是以句子、意境、节奏为单位,而倒背时则是在脑中「反向调用」这些单元。 研究者们在 2023 年的一篇论文中就提到了这个现象,并将其命名为「Reversal Curse(反转诅咒)」。类似的表现还包括模型学习了「A is B(如巴黎是法国的首 都)」之后,却无法自然地推出「B is A(如法国的首都是哪里)」。 这个问题之所以被拎出来讨论,是因为它会在一些需要模型同时理解前后文或逆向推理的场景中影响性能。 两年过去,AI 大 ...
全新范式!LLaDA-VLA:首个基于大语言扩散模型的VLA模型
具身智能之心· 2025-09-12 00:05
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Yuqing Wen等 编辑丨具身智能之心 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 近年来,视觉-语言模型(Vision-Language Models, VLMs)取得了飞跃式进展。其中,自回归模型长期占据主导地位,展现了强大的多模理解与泛化能力,并推 动视觉-语言-动作模型(Vision-Language-Action Models, VLAs)成为了机器人智能控制的研究热点。然而,自回归模型的单向顺序生成机制在效率与灵活性上存 在天然瓶瓶颈。为突破这一困境, 掩码扩散模型 (Masked Diffusion Models, MDMs)强势崛起,凭借并行预测与多轮迭代优化,在大规模预训练下展现出于自 回归模型可比的性能与可扩展性,代表性的工作有 大语言扩散模型 LLaDA,以及其多模态拓展LLaDA-V等。 然而,大语言扩散模型在 机器人动作生成 上的价值仍未被充分挖掘。为此,我们提出 LLaDA-VLA —首个大语言扩散模型开发的 ...
挑战 next token prediction,Diffusion LLM 够格吗?
机器之心· 2025-06-08 02:11
Group 1 - The article discusses the potential of Diffusion LLMs, particularly Gemini Diffusion, as a significant breakthrough in AI, challenging traditional autoregressive models [3][4][5] - Gemini Diffusion demonstrates high generation efficiency, achieving an average sampling speed of 1479 TPS and up to 2000 TPS in encoding tasks, outperforming Gemini 2.0 Flash-Lite by 4-5 times [4][6] - The parallel generation mechanism of the diffusion architecture allows for efficient processing, which could lead to reduced computational costs compared to autoregressive models [6][7] Group 2 - Mary Meeker emphasizes that the speed of AI development surpasses that of the internet era, highlighting the cost disparity between AI model training and inference [1][2] - The article suggests that the rise of open-source models in China may impact the global supply chain, indicating a shift in competitive dynamics within the industry [1][2] - The balance between computational investment and commercial returns is crucial for enterprises as AI inference costs decline [1][2]
冲击自回归,扩散模型正在改写下一代通用模型范式
机器之心· 2025-06-04 01:59
Core Viewpoint - The article discusses the advancements in diffusion language models (dLLMs), particularly focusing on Google's Gemini Diffusion and its implications for AI development, highlighting the speed and performance improvements over traditional autoregressive models [1][8][35]. Group 1: Gemini Diffusion and Its Features - Gemini Diffusion is noted for its impressive generation speed, being five times faster than previous models, and its ability to handle programming tasks effectively [2][8]. - The underlying mechanism of diffusion models allows for rapid iteration and error correction during the generation process, distinguishing it from autoregressive models [2][3]. - Gemini Diffusion's sampling speed can reach an astonishing 1479 tokens per second, showcasing its potential in various benchmarks [8][9]. Group 2: Development of Diffusion Language Models - Prior to Gemini Diffusion, several research teams explored the feasibility of diffusion-based LLMs, including Stanford's Diffusion-LM and Fudan University's DiffusionBERT [3][4]. - The introduction of LLaDA, the first 8 billion parameter diffusion language model, marked a significant milestone in the field, achieving performance comparable to LLaMA 3 [4][21]. - Following LLaDA, other models like d1 and LaViDa have emerged, further establishing LLaDA as a foundational model in dLLM research [20][21]. Group 3: Multimodal Diffusion Language Models - The emergence of diffusion multimodal language models (dMLLMs) is highlighted, with LLaDA-V and MMaDA being prominent examples that integrate visual and language processing capabilities [10][31]. - LLaDA-V combines visual instruction fine-tuning with the diffusion mechanism, demonstrating strong performance in multimodal understanding tasks [26][27]. - MMaDA showcases innovations in text reasoning and multimodal understanding, solidifying its position as a leading research outcome in the dMLLM space [31][32]. Group 4: Future Directions and Implications - The article emphasizes the shift from autoregressive models to diffusion models as a significant paradigm change in AI, suggesting broader implications for future research and applications [35][36]. - The ongoing evolution of models like LLaDA and Gemini Diffusion indicates a growing ecosystem around dLLMs and dMLLMs, with potential applications extending into quantum computing [35][36].
多模态扩散模型开始爆发,这次是高速可控还能学习推理的LaViDa
机器之心· 2025-05-30 04:16
Core Viewpoint - The article introduces LaViDa, a large vision-language diffusion model that combines the advantages of diffusion models with the ability to process both visual and textual information effectively [1][5]. Group 1: Model Overview - LaViDa is a vision-language model that inherits the high speed and controllability of diffusion language models, achieving impressive performance in experiments [1][5]. - Unlike autoregressive large language models (LLMs), diffusion models treat text generation as a diffusion process over discrete tokens, allowing for better handling of tasks requiring bidirectional context [2][3][4]. Group 2: Technical Architecture - LaViDa consists of a visual encoder and a diffusion language model, connected through a multi-layer perceptron (MLP) projection network [10]. - The visual encoder processes multiple views of an input image, generating a total of 3645 embeddings, which are then reduced to 980 through average pooling for training efficiency [12][13]. Group 3: Training Methodology - The training process involves a two-stage approach: pre-training to align visual embeddings with the diffusion language model's latent space, followed by end-to-end fine-tuning for instruction adherence [19]. - A third training phase using distilled samples was conducted to enhance the reasoning capabilities of LaViDa, resulting in a model named LaViDa-Reason [25]. Group 4: Experimental Performance - LaViDa demonstrates competitive performance across various visual-language tasks, achieving the highest score of 43.3 on the MMMU benchmark and excelling in reasoning tasks [20][22]. - In scientific tasks, LaViDa scored 81.4 and 80.2 on ScienceQA, showcasing its strong capabilities in complex reasoning [23]. Group 5: Text Completion and Flexibility - LaViDa provides strong controllability for text generation, particularly in text completion tasks, allowing for flexible token replacement based on masked inputs [28][30]. - The model can dynamically adjust the number of tokens generated, successfully completing tasks that require specific constraints, unlike autoregressive models [31][32]. Group 6: Speed and Quality Trade-offs - LaViDa allows users to balance speed and quality by adjusting the number of diffusion steps, demonstrating flexibility in performance based on application needs [33][35]. - Performance evaluations indicate that LaViDa can outperform autoregressive baselines in speed and quality under certain configurations, highlighting its adaptability [35].
舍弃自回归!国内团队打造纯扩散多模态大模型LLaDA-V,理解任务新SOTA
机器之心· 2025-05-27 03:23
Core Viewpoint - The article discusses the development of LLaDA-V, a pure diffusion multimodal large language model (MLLM) that integrates visual instruction tuning, marking a significant breakthrough in multimodal understanding compared to traditional autoregressive methods [1][16]. Group 1: Model Development - The research team expanded LLaDA into the multimodal domain, introducing LLaDA-V, which utilizes a visual encoder (SigLIP 2) and an MLP connector to project visual features into the language embedding space, achieving effective multimodal alignment [2]. - LLaDA-V employs a discrete diffusion mechanism during both training and sampling phases, moving away from the autoregressive paradigm [2]. Group 2: Performance Highlights - LLaDA-V demonstrates strong data scalability and competitive performance, outperforming the autoregressive baseline LLaMA3-V in 11 multimodal tasks, despite LLaDA-8B being slightly inferior to LLaMA3-8B in pure text tasks [5]. - The model achieves state-of-the-art (SOTA) performance in multimodal understanding tasks compared to existing mixed autoregressive-diffusion models, validating the effectiveness of the MLLM architecture based on powerful language diffusion models [8]. - LLaDA-V significantly narrows the performance gap with top autoregressive MLLMs, achieving comparable results in benchmarks like MMStar [10]. Group 3: Core Methodology - The core of LLaDA-V lies in combining visual instruction tuning with LLaDA's masking diffusion mechanism, allowing for a robust training and inference process [13][15]. - The architecture consists of a classic "visual encoder + MLP projector + language model" setup, where the visual encoder extracts image features, and the MLP projector maps them to LLaDA's embedding space [15]. - LLaDA-V's training objective supports multi-turn multimodal dialogue by masking only the model's responses during training, optimizing the model's ability to generate coherent replies [15]. Group 4: Future Outlook - The successful integration of visual instruction tuning with masking diffusion models opens a new technical pathway for MLLM development, challenging the notion that multimodal intelligence must rely on autoregressive models [16]. - The ongoing advancement of language diffusion models is expected to play a more significant role in the future, further pushing the boundaries of multimodal AI [16].