Workflow
扩散模型
icon
Search documents
舍弃自回归!国内团队打造纯扩散多模态大模型LLaDA-V,理解任务新SOTA
机器之心· 2025-05-27 03:23
Core Viewpoint - The article discusses the development of LLaDA-V, a pure diffusion multimodal large language model (MLLM) that integrates visual instruction tuning, marking a significant breakthrough in multimodal understanding compared to traditional autoregressive methods [1][16]. Group 1: Model Development - The research team expanded LLaDA into the multimodal domain, introducing LLaDA-V, which utilizes a visual encoder (SigLIP 2) and an MLP connector to project visual features into the language embedding space, achieving effective multimodal alignment [2]. - LLaDA-V employs a discrete diffusion mechanism during both training and sampling phases, moving away from the autoregressive paradigm [2]. Group 2: Performance Highlights - LLaDA-V demonstrates strong data scalability and competitive performance, outperforming the autoregressive baseline LLaMA3-V in 11 multimodal tasks, despite LLaDA-8B being slightly inferior to LLaMA3-8B in pure text tasks [5]. - The model achieves state-of-the-art (SOTA) performance in multimodal understanding tasks compared to existing mixed autoregressive-diffusion models, validating the effectiveness of the MLLM architecture based on powerful language diffusion models [8]. - LLaDA-V significantly narrows the performance gap with top autoregressive MLLMs, achieving comparable results in benchmarks like MMStar [10]. Group 3: Core Methodology - The core of LLaDA-V lies in combining visual instruction tuning with LLaDA's masking diffusion mechanism, allowing for a robust training and inference process [13][15]. - The architecture consists of a classic "visual encoder + MLP projector + language model" setup, where the visual encoder extracts image features, and the MLP projector maps them to LLaDA's embedding space [15]. - LLaDA-V's training objective supports multi-turn multimodal dialogue by masking only the model's responses during training, optimizing the model's ability to generate coherent replies [15]. Group 4: Future Outlook - The successful integration of visual instruction tuning with masking diffusion models opens a new technical pathway for MLLM development, challenging the notion that multimodal intelligence must rely on autoregressive models [16]. - The ongoing advancement of language diffusion models is expected to play a more significant role in the future, further pushing the boundaries of multimodal AI [16].
12秒生成1万token!谷歌推出文本「扩散模型」Gemini Diffusion,研究员:演示都得降速看
量子位· 2025-05-21 10:39
Core Viewpoint - Google DeepMind has introduced Gemini Diffusion, a new language model that utilizes diffusion technology to significantly enhance text generation speed and quality compared to traditional autoregressive models [1][4][9]. Group 1: Technology and Performance - Gemini Diffusion can generate text at a speed of 2000 tokens per second, which is faster than the previous model, Gemini 2.0 Flash-Lite [7][11]. - The model employs a unique approach of refining noise to learn output generation, allowing for rapid iterations and error correction during the generation process [6][10][15]. - Unlike traditional models that generate one token at a time, Gemini Diffusion can generate entire blocks of tokens simultaneously, resulting in more coherent responses [14][9]. Group 2: Benchmarking and Comparisons - Benchmark tests show that Gemini Diffusion performs comparably to larger models, with specific metrics indicating it outperforms Gemini 2.0 Flash-Lite in several coding tasks [8]. - For example, in the HumanEval benchmark, Gemini Diffusion achieved a score of 76.0%, slightly higher than Gemini 2.0 Flash-Lite's 75.8% [8]. Group 3: Implications and Future Directions - The introduction of diffusion technology in language models suggests a potential shift towards more hybrid models in the future, as seen in similar research by other institutions [19][20]. - The ability to perform non-causal reasoning during text generation opens up new possibilities for complex problem-solving tasks that traditional autoregressive models struggle with [16][17].
何恺明等新作大道至简,瞬时速度改为平均速度,一步生成表现提升70%
量子位· 2025-05-21 06:31
Core Viewpoint - The article discusses the introduction of a new model called MeanFlow, which utilizes average velocity to achieve a one-step generation framework, significantly improving the state-of-the-art (SOTA) in image generation tasks [1][5][10]. Group 1: Model Development - The MeanFlow model is trained from scratch without any pre-training, distillation, or curriculum learning, achieving a Fréchet Inception Distance (FID) score of 3.43, which is a notable improvement over previous one-step diffusion/flow models [3][10][13]. - The model introduces the concept of average velocity to represent flow fields, contrasting with instantaneous velocity used in flow matching methods [5][9]. Group 2: Experimental Results - Experiments conducted on ImageNet at a resolution of 256×256 demonstrated that the MeanFlow model achieved a 50% to 70% relative advantage over previous state-of-the-art methods in terms of FID scores [13][19]. - The model's performance was evaluated through an ablation study, showing various configurations and their corresponding FID scores, with the best results achieved under specific parameter settings [15][19]. Group 3: Scalability and Comparison - The MeanFlow model exhibits good scalability in terms of model size, with different configurations yielding competitive FID scores compared to other generative models [16][19]. - A comparison of the MeanFlow model with other generative models indicates that it significantly narrows the gap between one-step diffusion/flow models and their multi-step predecessors [19][20]. Group 4: Research Team and Background - The research was conducted by a team from MIT and CMU, including notable contributors such as PhD student Geng Zhengyang and other students of He Kaiming [21][22][23]. - The team aims to bridge the gap between generative modeling and simulations in physics, addressing multi-scale simulation problems [20].
TransDiffuser: 理想VLA diffusion出轨迹的架构
理想TOP2· 2025-05-18 13:08
Core Viewpoint - The article discusses the advancements in the field of autonomous driving, particularly focusing on the Diffusion model and its application in generating driving trajectories, highlighting the differences between VLM and VLA systems [1][4]. Group 1: Diffusion Model Explanation - Diffusion is a generative model that learns data distribution through a process of adding noise (Forward Process) and removing noise (Reverse Process), akin to a reverse puzzle [4]. - The model's denoising process involves training a neural network to predict and remove noise, ultimately generating target data [4]. - Diffusion not only generates the vehicle's trajectory but also predicts the trajectories of other vehicles and pedestrians, enhancing decision-making in complex traffic environments [5]. Group 2: VLM and VLA Systems - VLM consists of two systems: System 1 mimics learning to output trajectories without semantic understanding, while System 2 has semantic understanding but only provides suggestions [2]. - VLA is a single system with both fast and slow thinking capabilities, inherently possessing semantic reasoning [2]. - The output of VLA is action tokens that encode the vehicle's driving behavior and surrounding environment, which are then decoded into driving trajectories using the Diffusion model [4][5]. Group 3: TransDiffuser Architecture - TransDiffuser is an end-to-end trajectory generation model that integrates multi-modal perception information to produce high-quality, diverse trajectories [6][7]. - The architecture includes a Scene Encoder for processing multi-modal data and a Denoising Decoder that utilizes the DDPM framework for trajectory generation [7][9]. - The model employs a multi-head cross-attention mechanism to fuse scene and motion features during the denoising process [9]. Group 4: Performance and Innovations - The model achieves a Predictive Driver Model Score (PDMS) of 94.85, outperforming existing methods [11]. - Key innovations include anchor-free trajectory generation and a multi-modal representation decorrelation optimization mechanism to enhance trajectory diversity and reduce redundancy [11][12]. Group 5: Limitations and Future Directions - The authors note challenges in fine-tuning the model, particularly the perception encoder [13]. - Future directions involve integrating reinforcement learning and referencing models like OpenVLA for further advancements [13].
一键开关灯!谷歌用扩散模型,将电影级光影控制玩到极致
机器之心· 2025-05-16 04:39
机器之心报道 编辑:刘欣、+0 最近,Google 推出了一个可以 精准控制画面中光影的项目 —— LightLab 。 它让用户能够从单张图像实现对光源的细粒度参数化控制, 可以改变可见光源的强度和颜色、环境光的强度,并且能够将虚拟光源插入场景中。 以电影为例, 好的电影中,光线能巧妙地塑造角色情绪、烘托故事氛围、引导观众目光,甚至能揭示人物的内心世界。 然而,无论是传统的摄影后期处理,还是数字渲染后的调整,精确控制光影方向、颜色和强度,始终是一项耗时耗力、且极依赖经验的挑战。 现有的光照编辑技术,要么需要很多照片才能工作(不适用于单张照片),要么虽然能编辑,但你不能精确地告诉它怎么变(比如具体亮多少、变成什么颜 色)。 Google 的研究团队通过在一个特殊构建的数据集上微调(fine-tune)扩散模型,使其学会如何精确地控制图像中的光照。 LightLab: Controlling Light Sources in Images with Diffusion Models 论文地址:https://arxiv.org/abs/2505.09608 项目主页:https://nadmag.github. ...
CVPR 2025 | 如何稳定且高效地生成个性化的多人图像?ID-Patch带来新解法
机器之心· 2025-05-03 04:18
Core Viewpoint - The article discusses the advancements and challenges in personalized multi-person image generation using diffusion models, highlighting the innovative ID-Patch mechanism that addresses identity leakage and enhances accuracy in positioning and identity representation [1][5][21]. Group 1: Challenges in Multi-Person Image Generation - Personalized single-person image generation has achieved impressive visual effects, but generating images with multiple people introduces complexities [4]. - Identity leakage is a significant challenge, where features of different individuals can blend, making it difficult to distinguish between them [2][4]. - Existing methods like OMG and InstantFamily have attempted to tackle identity confusion but face limitations in efficiency and accuracy, especially as the number of individuals increases [4][14]. Group 2: ID-Patch Mechanism - ID-Patch is a novel solution designed specifically for multi-person image generation, focusing on binding identity and position [6][21]. - The mechanism separates facial information into two key modules, allowing for precise placement of individuals while maintaining their unique identities [9][21]. - ID-Patch integrates various spatial conditions, such as pose and depth maps, enhancing its adaptability to complex scenes [10][21]. Group 3: Performance and Efficiency - ID-Patch demonstrates superior performance in identity resemblance (0.751) and identity-position matching (0.958), showcasing its effectiveness in maintaining facial consistency and accurate placement [12]. - In terms of generation speed, ID-Patch is the fastest among existing methods, generating an 8-person group photo in approximately 10 seconds, compared to nearly 2 minutes for OMG [17][15]. - The performance of ID-Patch remains robust even as the number of faces increases, with only a slight decline in effectiveness [14][21]. Group 4: Future Directions - There is potential for further improvement in facial feature representation by incorporating diverse images of the same identity under varying lighting and expressions [20]. - Future explorations may include enhancing facial fidelity through multi-angle images and achieving dual control over position and expression using patch technology [22].
阶跃星辰开源图像编辑模型Step1X-Edit;阿里巴巴AI旗舰应用夸克发布全新“AI相机”丨AIGC日报
创业邦· 2025-04-27 23:48
扫码订阅 AIGC 产业日报, 3.【Meta Token-Shuffle登场:自回归模型突破瓶颈,可AI生成 2048×2048 分辨率图像】报道称Meta AI创 新推出Token-Shuffle,目标解决自回归(Autoregressive,AR)模型在生成高分辨率图像方面的扩展难 题。在语言生成方面,自回归模型大放异彩,近年来也被广泛探索用于图像合成,然而在面对高分辨率 图像时,AR模型遭遇瓶颈。不同于文本生成仅需少量token,图像合成中高分辨率图片往往需要数千个 token,计算成本随之暴增。这让许多基于 AR 的多模态模型只能处理低中分辨率图像,限制了其在精细 图像生成中的应用。尽管扩散模型(Diffusion Models)在高分辨率上表现强劲,但其复杂的采样过程和 较慢的推理速度也存在局限。(搜狐) 4.【Adobe发布Firefly Image Model 4模型:AI生图再升级】Adobe发布博文,推出Firefly Image Model 4和 Firefly Image Model 4 Ultra两款文本生成图像AI模型,并预告针对Photoshop和Illustrator的Crea ...
ICLR 2025 | 无需训练加速20倍,清华朱军组提出用于图像翻译的扩散桥模型推理算法DBIM
机器之心· 2025-04-27 10:40
论文有两位共同一作。郑凯文为清华大学计算机系三年级博士生,何冠德为德州大学奥斯汀分校(UT Austin)一年级博士生。 扩散模型(Diffusion Models)近年来在生成任务上取得了突破性的进展,不仅在图像生成、视频合成、语音合成等领域都实现了卓越表现,推动了文本到图像、 视频生成的技术革新。然而,标准扩散模型的设计通常只适用于从随机噪声生成数据的任务,对于图像翻译或图像修复这类明确给定输入和输出之间映射关系的 任务并不适合。 为了解决这一问题,一种名为 去噪扩散桥模型 (Denoising Diffusion Bridge Models, DDBMs)的变种应运而生。DDBM 能够建模两个给定分布之间的桥接过程, 从而很好地应用于图像翻译、图像修复等任务。然而,这类模型在数学形式上依赖 复杂的常微分方程 / 随机微分方程 ,在生成高分辨率图像时通常需要 数百步的 迭代 , 计算效率低下 ,严重限制了其在实际中的广泛应用。 相比于标准扩散模型,扩散桥模型的推理过程 额外涉及初始条件相关的线性组合和起始点的奇异性 ,无法直接应用标准扩散模型的推理算法。为此,清华大学朱 军团队提出了一种名为 扩散桥隐式模 ...
“计算机视觉被GPT-4o终结了”(狗头)
量子位· 2025-03-29 07:46
Core Viewpoint - The article discusses the advancements in computer vision (CV) and image generation capabilities brought by the new GPT-4o model, highlighting its potential to disrupt existing tools and methodologies in the field [1][2]. Group 1: Technological Advancements - GPT-4o introduces native multimodal image generation, expanding the functionalities of AI tools beyond traditional applications [2][12]. - The image generation process in GPT-4o is based on a self-regressive model, differing from the diffusion model used in DALL·E, which allows for better adherence to instructions and enhanced image editing capabilities [15][19]. - Observations suggest that the image generation may involve a multi-scale self-regressive combination, where a rough image is generated first, followed by detail filling while the rough shape evolves [17][19]. Group 2: Industry Impact - The advancements in GPT-4o's capabilities have raised concerns among designers and computer vision researchers, indicating a significant shift in the competitive landscape of AI tools [6][10]. - OpenAI's approach of scaling foundational models to achieve these capabilities has surprised many in the industry, suggesting a new trend in AI development [12][19]. - The potential for GPT-4o to enhance applications in autonomous driving has been noted, with implications for future developments in this sector [10]. Group 3: Community Engagement - The article encourages community members to share their experiences and innovative uses of GPT-4o, fostering a collaborative environment for exploring AI applications [26].
活动报名:我们凑齐了 LCM、InstantID 和 AnimateDiff 的作者分享啦
42章经· 2024-05-26 14:35
清华交叉信息研究院硕士,研究方向为多模态生成,扩散模型,一致性模型 代表工作有 LCM, LCM-LoRA, Diff-Foley · 王浩帆 硕士毕业于 CMU,InstantX 团队成员,研究方向为一致性生成 代表工作有 InstantStyle, InstantID 和 Score-CAM · 杨策元 42章经 AI 私董会活动 文生图与文生视频 从研究到应用 分享嘉宾 · 骆思勉 LCM、InstantID 和 AnimateDiff 这三个研究在全球的意义和影响力都非常之大,可以说是过去一整年里给文生图和文生视频相关领域带来极大突破或应用 落地性的工作,相信有非常多的创业者都在实际使用这些作品的结果。 这次,我们首次把这三个工作的作者凑齐,并且还请来了知名的 AI 产品经理 Hidecloud 做 Panel 主持,届时期待和数十位 AI 创业者一起交流下文生图、文生视频 领域最新的研究和落地。 PhD 毕业于香港中文大学,研究方向为视频生成 6/01 | 13:00-14:00 (周六) 北京时间 美西时间 5/31 | 22:00-23:00 (周五) 活动形式 线上(会议链接将一对一发送) ...