扩散模型
Search documents
冲击自回归,扩散模型正在改写下一代通用模型范式
机器之心· 2025-06-04 01:59
Core Viewpoint - The article discusses the advancements in diffusion language models (dLLMs), particularly focusing on Google's Gemini Diffusion and its implications for AI development, highlighting the speed and performance improvements over traditional autoregressive models [1][8][35]. Group 1: Gemini Diffusion and Its Features - Gemini Diffusion is noted for its impressive generation speed, being five times faster than previous models, and its ability to handle programming tasks effectively [2][8]. - The underlying mechanism of diffusion models allows for rapid iteration and error correction during the generation process, distinguishing it from autoregressive models [2][3]. - Gemini Diffusion's sampling speed can reach an astonishing 1479 tokens per second, showcasing its potential in various benchmarks [8][9]. Group 2: Development of Diffusion Language Models - Prior to Gemini Diffusion, several research teams explored the feasibility of diffusion-based LLMs, including Stanford's Diffusion-LM and Fudan University's DiffusionBERT [3][4]. - The introduction of LLaDA, the first 8 billion parameter diffusion language model, marked a significant milestone in the field, achieving performance comparable to LLaMA 3 [4][21]. - Following LLaDA, other models like d1 and LaViDa have emerged, further establishing LLaDA as a foundational model in dLLM research [20][21]. Group 3: Multimodal Diffusion Language Models - The emergence of diffusion multimodal language models (dMLLMs) is highlighted, with LLaDA-V and MMaDA being prominent examples that integrate visual and language processing capabilities [10][31]. - LLaDA-V combines visual instruction fine-tuning with the diffusion mechanism, demonstrating strong performance in multimodal understanding tasks [26][27]. - MMaDA showcases innovations in text reasoning and multimodal understanding, solidifying its position as a leading research outcome in the dMLLM space [31][32]. Group 4: Future Directions and Implications - The article emphasizes the shift from autoregressive models to diffusion models as a significant paradigm change in AI, suggesting broader implications for future research and applications [35][36]. - The ongoing evolution of models like LLaDA and Gemini Diffusion indicates a growing ecosystem around dLLMs and dMLLMs, with potential applications extending into quantum computing [35][36].
多模态扩散模型开始爆发,这次是高速可控还能学习推理的LaViDa
机器之心· 2025-05-30 04:16
Core Viewpoint - The article introduces LaViDa, a large vision-language diffusion model that combines the advantages of diffusion models with the ability to process both visual and textual information effectively [1][5]. Group 1: Model Overview - LaViDa is a vision-language model that inherits the high speed and controllability of diffusion language models, achieving impressive performance in experiments [1][5]. - Unlike autoregressive large language models (LLMs), diffusion models treat text generation as a diffusion process over discrete tokens, allowing for better handling of tasks requiring bidirectional context [2][3][4]. Group 2: Technical Architecture - LaViDa consists of a visual encoder and a diffusion language model, connected through a multi-layer perceptron (MLP) projection network [10]. - The visual encoder processes multiple views of an input image, generating a total of 3645 embeddings, which are then reduced to 980 through average pooling for training efficiency [12][13]. Group 3: Training Methodology - The training process involves a two-stage approach: pre-training to align visual embeddings with the diffusion language model's latent space, followed by end-to-end fine-tuning for instruction adherence [19]. - A third training phase using distilled samples was conducted to enhance the reasoning capabilities of LaViDa, resulting in a model named LaViDa-Reason [25]. Group 4: Experimental Performance - LaViDa demonstrates competitive performance across various visual-language tasks, achieving the highest score of 43.3 on the MMMU benchmark and excelling in reasoning tasks [20][22]. - In scientific tasks, LaViDa scored 81.4 and 80.2 on ScienceQA, showcasing its strong capabilities in complex reasoning [23]. Group 5: Text Completion and Flexibility - LaViDa provides strong controllability for text generation, particularly in text completion tasks, allowing for flexible token replacement based on masked inputs [28][30]. - The model can dynamically adjust the number of tokens generated, successfully completing tasks that require specific constraints, unlike autoregressive models [31][32]. Group 6: Speed and Quality Trade-offs - LaViDa allows users to balance speed and quality by adjusting the number of diffusion steps, demonstrating flexibility in performance based on application needs [33][35]. - Performance evaluations indicate that LaViDa can outperform autoregressive baselines in speed and quality under certain configurations, highlighting its adaptability [35].
舍弃自回归!国内团队打造纯扩散多模态大模型LLaDA-V,理解任务新SOTA
机器之心· 2025-05-27 03:23
Core Viewpoint - The article discusses the development of LLaDA-V, a pure diffusion multimodal large language model (MLLM) that integrates visual instruction tuning, marking a significant breakthrough in multimodal understanding compared to traditional autoregressive methods [1][16]. Group 1: Model Development - The research team expanded LLaDA into the multimodal domain, introducing LLaDA-V, which utilizes a visual encoder (SigLIP 2) and an MLP connector to project visual features into the language embedding space, achieving effective multimodal alignment [2]. - LLaDA-V employs a discrete diffusion mechanism during both training and sampling phases, moving away from the autoregressive paradigm [2]. Group 2: Performance Highlights - LLaDA-V demonstrates strong data scalability and competitive performance, outperforming the autoregressive baseline LLaMA3-V in 11 multimodal tasks, despite LLaDA-8B being slightly inferior to LLaMA3-8B in pure text tasks [5]. - The model achieves state-of-the-art (SOTA) performance in multimodal understanding tasks compared to existing mixed autoregressive-diffusion models, validating the effectiveness of the MLLM architecture based on powerful language diffusion models [8]. - LLaDA-V significantly narrows the performance gap with top autoregressive MLLMs, achieving comparable results in benchmarks like MMStar [10]. Group 3: Core Methodology - The core of LLaDA-V lies in combining visual instruction tuning with LLaDA's masking diffusion mechanism, allowing for a robust training and inference process [13][15]. - The architecture consists of a classic "visual encoder + MLP projector + language model" setup, where the visual encoder extracts image features, and the MLP projector maps them to LLaDA's embedding space [15]. - LLaDA-V's training objective supports multi-turn multimodal dialogue by masking only the model's responses during training, optimizing the model's ability to generate coherent replies [15]. Group 4: Future Outlook - The successful integration of visual instruction tuning with masking diffusion models opens a new technical pathway for MLLM development, challenging the notion that multimodal intelligence must rely on autoregressive models [16]. - The ongoing advancement of language diffusion models is expected to play a more significant role in the future, further pushing the boundaries of multimodal AI [16].
12秒生成1万token!谷歌推出文本「扩散模型」Gemini Diffusion,研究员:演示都得降速看
量子位· 2025-05-21 10:39
Core Viewpoint - Google DeepMind has introduced Gemini Diffusion, a new language model that utilizes diffusion technology to significantly enhance text generation speed and quality compared to traditional autoregressive models [1][4][9]. Group 1: Technology and Performance - Gemini Diffusion can generate text at a speed of 2000 tokens per second, which is faster than the previous model, Gemini 2.0 Flash-Lite [7][11]. - The model employs a unique approach of refining noise to learn output generation, allowing for rapid iterations and error correction during the generation process [6][10][15]. - Unlike traditional models that generate one token at a time, Gemini Diffusion can generate entire blocks of tokens simultaneously, resulting in more coherent responses [14][9]. Group 2: Benchmarking and Comparisons - Benchmark tests show that Gemini Diffusion performs comparably to larger models, with specific metrics indicating it outperforms Gemini 2.0 Flash-Lite in several coding tasks [8]. - For example, in the HumanEval benchmark, Gemini Diffusion achieved a score of 76.0%, slightly higher than Gemini 2.0 Flash-Lite's 75.8% [8]. Group 3: Implications and Future Directions - The introduction of diffusion technology in language models suggests a potential shift towards more hybrid models in the future, as seen in similar research by other institutions [19][20]. - The ability to perform non-causal reasoning during text generation opens up new possibilities for complex problem-solving tasks that traditional autoregressive models struggle with [16][17].
何恺明等新作大道至简,瞬时速度改为平均速度,一步生成表现提升70%
量子位· 2025-05-21 06:31
Core Viewpoint - The article discusses the introduction of a new model called MeanFlow, which utilizes average velocity to achieve a one-step generation framework, significantly improving the state-of-the-art (SOTA) in image generation tasks [1][5][10]. Group 1: Model Development - The MeanFlow model is trained from scratch without any pre-training, distillation, or curriculum learning, achieving a Fréchet Inception Distance (FID) score of 3.43, which is a notable improvement over previous one-step diffusion/flow models [3][10][13]. - The model introduces the concept of average velocity to represent flow fields, contrasting with instantaneous velocity used in flow matching methods [5][9]. Group 2: Experimental Results - Experiments conducted on ImageNet at a resolution of 256×256 demonstrated that the MeanFlow model achieved a 50% to 70% relative advantage over previous state-of-the-art methods in terms of FID scores [13][19]. - The model's performance was evaluated through an ablation study, showing various configurations and their corresponding FID scores, with the best results achieved under specific parameter settings [15][19]. Group 3: Scalability and Comparison - The MeanFlow model exhibits good scalability in terms of model size, with different configurations yielding competitive FID scores compared to other generative models [16][19]. - A comparison of the MeanFlow model with other generative models indicates that it significantly narrows the gap between one-step diffusion/flow models and their multi-step predecessors [19][20]. Group 4: Research Team and Background - The research was conducted by a team from MIT and CMU, including notable contributors such as PhD student Geng Zhengyang and other students of He Kaiming [21][22][23]. - The team aims to bridge the gap between generative modeling and simulations in physics, addressing multi-scale simulation problems [20].
TransDiffuser: 理想VLA diffusion出轨迹的架构
理想TOP2· 2025-05-18 13:08
Core Viewpoint - The article discusses the advancements in the field of autonomous driving, particularly focusing on the Diffusion model and its application in generating driving trajectories, highlighting the differences between VLM and VLA systems [1][4]. Group 1: Diffusion Model Explanation - Diffusion is a generative model that learns data distribution through a process of adding noise (Forward Process) and removing noise (Reverse Process), akin to a reverse puzzle [4]. - The model's denoising process involves training a neural network to predict and remove noise, ultimately generating target data [4]. - Diffusion not only generates the vehicle's trajectory but also predicts the trajectories of other vehicles and pedestrians, enhancing decision-making in complex traffic environments [5]. Group 2: VLM and VLA Systems - VLM consists of two systems: System 1 mimics learning to output trajectories without semantic understanding, while System 2 has semantic understanding but only provides suggestions [2]. - VLA is a single system with both fast and slow thinking capabilities, inherently possessing semantic reasoning [2]. - The output of VLA is action tokens that encode the vehicle's driving behavior and surrounding environment, which are then decoded into driving trajectories using the Diffusion model [4][5]. Group 3: TransDiffuser Architecture - TransDiffuser is an end-to-end trajectory generation model that integrates multi-modal perception information to produce high-quality, diverse trajectories [6][7]. - The architecture includes a Scene Encoder for processing multi-modal data and a Denoising Decoder that utilizes the DDPM framework for trajectory generation [7][9]. - The model employs a multi-head cross-attention mechanism to fuse scene and motion features during the denoising process [9]. Group 4: Performance and Innovations - The model achieves a Predictive Driver Model Score (PDMS) of 94.85, outperforming existing methods [11]. - Key innovations include anchor-free trajectory generation and a multi-modal representation decorrelation optimization mechanism to enhance trajectory diversity and reduce redundancy [11][12]. Group 5: Limitations and Future Directions - The authors note challenges in fine-tuning the model, particularly the perception encoder [13]. - Future directions involve integrating reinforcement learning and referencing models like OpenVLA for further advancements [13].
一键开关灯!谷歌用扩散模型,将电影级光影控制玩到极致
机器之心· 2025-05-16 04:39
Core Viewpoint - Google has launched LightLab, a project that allows precise control over lighting in images, enabling users to adjust light source intensity, color, and insert virtual light sources into scenes [1][2]. Group 1: Technology and Methodology - LightLab utilizes a fine-tuned diffusion model trained on a specially constructed dataset to achieve precise control over lighting in images [7][11]. - The dataset combines real images with controlled lighting changes and synthetic images generated by a physical renderer, allowing the model to learn complex lighting effects [10]. - The model can simulate indirect lighting, shadows, and reflections, providing a photorealistic prior for lighting control [10][11]. Group 2: Data Collection and Processing - The research team captured 600 pairs of original photos depicting the same scene with a single light source turned on and off, ensuring good exposure through automatic settings [22][23]. - The dataset was expanded to approximately 36,000 images through post-processing to cover a range of intensities and colors [27]. - The team employed a consistent tone mapping strategy and separated target light source changes from ambient light in the images [17][18]. Group 3: Model Training and Evaluation - The model was trained for 45,000 steps at a resolution of 1024 × 1024, using a learning rate of 10−5 and a batch size of 128, taking about 12 hours on 64 v4 TPUs [28]. - Evaluation metrics included Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), with user studies conducted to validate results [29]. - The model demonstrated superior performance compared to previous methods, achieving a PSNR of 23.2 and an SSIM of 0.818 [31][33]. Group 4: Applications and Features - LightLab offers a range of lighting control features, allowing users to adjust light source intensity and color interactively [12][38][41]. - The technology enables the insertion of virtual point light sources into scenes, enhancing creative possibilities [44]. - The separation of target light sources from ambient light allows for control over natural light entering a scene, which is typically challenging to manage [45].
DiffMoE:动态Token选择助力扩散模型性能飞跃,快手&清华团队打造视觉生成新标杆!
机器之心· 2025-05-16 02:42
在生成式 AI 领域,扩散模型(Diffusion Models)已成为图像生成任务的主流架构。然而,传统扩散模型在处理不同噪声水平和条件输入时采用统一处理方式,未 能充分利用扩散过程的异构特性,导致计算效率低下,近期,可灵团队推出 DiffMoE(Dynamic Token Selection for Scalable Diffusion Transformers), 通过创新的 动态token选择机制和全局token池设计,拓展了扩散模型的效率与性能边界。 本文由清华大学和快手可灵团队共同完成。第一作者是清华大学智能视觉实验室在读本科生史明磊。 核心突破:动态token选择与全局上下文感知 论文标题:DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers 项目主页: https://shiml20.github.io/DiffMoE/ 论文地址: https://arxiv.org/abs/2503.14487 代码: https://github.com/KwaiVGI/DiffMoE 性能提升:以少胜多的参数高效模型 在 ...
为什么现在做AI的产品经理,都是在亏钱?
3 6 Ke· 2025-05-06 01:50
Core Insights - The current landscape of AI product management is characterized by a focus on iterative improvements rather than creating products from scratch, leading to instability and financial losses for AI product managers [1][21] - The transformer model, while popular, is not necessarily the best architecture for AI applications, as it struggles with issues like hallucination and high training costs [2][5] - The emergence of alternative models, such as diffusion models and the yan model, indicates a shift in the AI landscape, with potential implications for product design and functionality [3][5] Group 1: AI Product Management Challenges - AI product managers are primarily engaged in API integration rather than developing proprietary models, limiting their ability to innovate and compete [6][8] - The high costs associated with AI model fine-tuning and infrastructure, including server costs and operational expenses, create significant barriers to profitability [9][10] - The user acquisition process for AI products still relies on traditional internet marketing strategies, which may not be sufficient to differentiate AI offerings in a crowded market [10][12] Group 2: User Perception and Market Dynamics - The transition of AI from a novelty to a necessity has not yet been fully realized, as the productivity gains from AI tools remain unclear [15][20] - Despite the potential of AI to assist in various tasks, the need for human oversight and correction limits the efficiency gains that users experience [17][21] - The willingness of users to pay for AI services is low, as many seek free alternatives or are hesitant to invest in AI tools that do not demonstrate clear value [21][22]
CVPR 2025 | 如何稳定且高效地生成个性化的多人图像?ID-Patch带来新解法
机器之心· 2025-05-03 04:18
Core Viewpoint - The article discusses the advancements and challenges in personalized multi-person image generation using diffusion models, highlighting the innovative ID-Patch mechanism that addresses identity leakage and enhances accuracy in positioning and identity representation [1][5][21]. Group 1: Challenges in Multi-Person Image Generation - Personalized single-person image generation has achieved impressive visual effects, but generating images with multiple people introduces complexities [4]. - Identity leakage is a significant challenge, where features of different individuals can blend, making it difficult to distinguish between them [2][4]. - Existing methods like OMG and InstantFamily have attempted to tackle identity confusion but face limitations in efficiency and accuracy, especially as the number of individuals increases [4][14]. Group 2: ID-Patch Mechanism - ID-Patch is a novel solution designed specifically for multi-person image generation, focusing on binding identity and position [6][21]. - The mechanism separates facial information into two key modules, allowing for precise placement of individuals while maintaining their unique identities [9][21]. - ID-Patch integrates various spatial conditions, such as pose and depth maps, enhancing its adaptability to complex scenes [10][21]. Group 3: Performance and Efficiency - ID-Patch demonstrates superior performance in identity resemblance (0.751) and identity-position matching (0.958), showcasing its effectiveness in maintaining facial consistency and accurate placement [12]. - In terms of generation speed, ID-Patch is the fastest among existing methods, generating an 8-person group photo in approximately 10 seconds, compared to nearly 2 minutes for OMG [17][15]. - The performance of ID-Patch remains robust even as the number of faces increases, with only a slight decline in effectiveness [14][21]. Group 4: Future Directions - There is potential for further improvement in facial feature representation by incorporating diverse images of the same identity under varying lighting and expressions [20]. - Future explorations may include enhancing facial fidelity through multi-angle images and achieving dual control over position and expression using patch technology [22].