Workflow
扩散模型
icon
Search documents
12秒生成1万token!谷歌推出文本「扩散模型」Gemini Diffusion,研究员:演示都得降速看
量子位· 2025-05-21 10:39
Core Viewpoint - Google DeepMind has introduced Gemini Diffusion, a new language model that utilizes diffusion technology to significantly enhance text generation speed and quality compared to traditional autoregressive models [1][4][9]. Group 1: Technology and Performance - Gemini Diffusion can generate text at a speed of 2000 tokens per second, which is faster than the previous model, Gemini 2.0 Flash-Lite [7][11]. - The model employs a unique approach of refining noise to learn output generation, allowing for rapid iterations and error correction during the generation process [6][10][15]. - Unlike traditional models that generate one token at a time, Gemini Diffusion can generate entire blocks of tokens simultaneously, resulting in more coherent responses [14][9]. Group 2: Benchmarking and Comparisons - Benchmark tests show that Gemini Diffusion performs comparably to larger models, with specific metrics indicating it outperforms Gemini 2.0 Flash-Lite in several coding tasks [8]. - For example, in the HumanEval benchmark, Gemini Diffusion achieved a score of 76.0%, slightly higher than Gemini 2.0 Flash-Lite's 75.8% [8]. Group 3: Implications and Future Directions - The introduction of diffusion technology in language models suggests a potential shift towards more hybrid models in the future, as seen in similar research by other institutions [19][20]. - The ability to perform non-causal reasoning during text generation opens up new possibilities for complex problem-solving tasks that traditional autoregressive models struggle with [16][17].
何恺明等新作大道至简,瞬时速度改为平均速度,一步生成表现提升70%
量子位· 2025-05-21 06:31
Core Viewpoint - The article discusses the introduction of a new model called MeanFlow, which utilizes average velocity to achieve a one-step generation framework, significantly improving the state-of-the-art (SOTA) in image generation tasks [1][5][10]. Group 1: Model Development - The MeanFlow model is trained from scratch without any pre-training, distillation, or curriculum learning, achieving a Fréchet Inception Distance (FID) score of 3.43, which is a notable improvement over previous one-step diffusion/flow models [3][10][13]. - The model introduces the concept of average velocity to represent flow fields, contrasting with instantaneous velocity used in flow matching methods [5][9]. Group 2: Experimental Results - Experiments conducted on ImageNet at a resolution of 256×256 demonstrated that the MeanFlow model achieved a 50% to 70% relative advantage over previous state-of-the-art methods in terms of FID scores [13][19]. - The model's performance was evaluated through an ablation study, showing various configurations and their corresponding FID scores, with the best results achieved under specific parameter settings [15][19]. Group 3: Scalability and Comparison - The MeanFlow model exhibits good scalability in terms of model size, with different configurations yielding competitive FID scores compared to other generative models [16][19]. - A comparison of the MeanFlow model with other generative models indicates that it significantly narrows the gap between one-step diffusion/flow models and their multi-step predecessors [19][20]. Group 4: Research Team and Background - The research was conducted by a team from MIT and CMU, including notable contributors such as PhD student Geng Zhengyang and other students of He Kaiming [21][22][23]. - The team aims to bridge the gap between generative modeling and simulations in physics, addressing multi-scale simulation problems [20].
TransDiffuser: 理想VLA diffusion出轨迹的架构
理想TOP2· 2025-05-18 13:08
Core Viewpoint - The article discusses the advancements in the field of autonomous driving, particularly focusing on the Diffusion model and its application in generating driving trajectories, highlighting the differences between VLM and VLA systems [1][4]. Group 1: Diffusion Model Explanation - Diffusion is a generative model that learns data distribution through a process of adding noise (Forward Process) and removing noise (Reverse Process), akin to a reverse puzzle [4]. - The model's denoising process involves training a neural network to predict and remove noise, ultimately generating target data [4]. - Diffusion not only generates the vehicle's trajectory but also predicts the trajectories of other vehicles and pedestrians, enhancing decision-making in complex traffic environments [5]. Group 2: VLM and VLA Systems - VLM consists of two systems: System 1 mimics learning to output trajectories without semantic understanding, while System 2 has semantic understanding but only provides suggestions [2]. - VLA is a single system with both fast and slow thinking capabilities, inherently possessing semantic reasoning [2]. - The output of VLA is action tokens that encode the vehicle's driving behavior and surrounding environment, which are then decoded into driving trajectories using the Diffusion model [4][5]. Group 3: TransDiffuser Architecture - TransDiffuser is an end-to-end trajectory generation model that integrates multi-modal perception information to produce high-quality, diverse trajectories [6][7]. - The architecture includes a Scene Encoder for processing multi-modal data and a Denoising Decoder that utilizes the DDPM framework for trajectory generation [7][9]. - The model employs a multi-head cross-attention mechanism to fuse scene and motion features during the denoising process [9]. Group 4: Performance and Innovations - The model achieves a Predictive Driver Model Score (PDMS) of 94.85, outperforming existing methods [11]. - Key innovations include anchor-free trajectory generation and a multi-modal representation decorrelation optimization mechanism to enhance trajectory diversity and reduce redundancy [11][12]. Group 5: Limitations and Future Directions - The authors note challenges in fine-tuning the model, particularly the perception encoder [13]. - Future directions involve integrating reinforcement learning and referencing models like OpenVLA for further advancements [13].
一键开关灯!谷歌用扩散模型,将电影级光影控制玩到极致
机器之心· 2025-05-16 04:39
Core Viewpoint - Google has launched LightLab, a project that allows precise control over lighting in images, enabling users to adjust light source intensity, color, and insert virtual light sources into scenes [1][2]. Group 1: Technology and Methodology - LightLab utilizes a fine-tuned diffusion model trained on a specially constructed dataset to achieve precise control over lighting in images [7][11]. - The dataset combines real images with controlled lighting changes and synthetic images generated by a physical renderer, allowing the model to learn complex lighting effects [10]. - The model can simulate indirect lighting, shadows, and reflections, providing a photorealistic prior for lighting control [10][11]. Group 2: Data Collection and Processing - The research team captured 600 pairs of original photos depicting the same scene with a single light source turned on and off, ensuring good exposure through automatic settings [22][23]. - The dataset was expanded to approximately 36,000 images through post-processing to cover a range of intensities and colors [27]. - The team employed a consistent tone mapping strategy and separated target light source changes from ambient light in the images [17][18]. Group 3: Model Training and Evaluation - The model was trained for 45,000 steps at a resolution of 1024 × 1024, using a learning rate of 10−5 and a batch size of 128, taking about 12 hours on 64 v4 TPUs [28]. - Evaluation metrics included Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), with user studies conducted to validate results [29]. - The model demonstrated superior performance compared to previous methods, achieving a PSNR of 23.2 and an SSIM of 0.818 [31][33]. Group 4: Applications and Features - LightLab offers a range of lighting control features, allowing users to adjust light source intensity and color interactively [12][38][41]. - The technology enables the insertion of virtual point light sources into scenes, enhancing creative possibilities [44]. - The separation of target light sources from ambient light allows for control over natural light entering a scene, which is typically challenging to manage [45].
DiffMoE:动态Token选择助力扩散模型性能飞跃,快手&清华团队打造视觉生成新标杆!
机器之心· 2025-05-16 02:42
在生成式 AI 领域,扩散模型(Diffusion Models)已成为图像生成任务的主流架构。然而,传统扩散模型在处理不同噪声水平和条件输入时采用统一处理方式,未 能充分利用扩散过程的异构特性,导致计算效率低下,近期,可灵团队推出 DiffMoE(Dynamic Token Selection for Scalable Diffusion Transformers), 通过创新的 动态token选择机制和全局token池设计,拓展了扩散模型的效率与性能边界。 本文由清华大学和快手可灵团队共同完成。第一作者是清华大学智能视觉实验室在读本科生史明磊。 核心突破:动态token选择与全局上下文感知 论文标题:DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers 项目主页: https://shiml20.github.io/DiffMoE/ 论文地址: https://arxiv.org/abs/2503.14487 代码: https://github.com/KwaiVGI/DiffMoE 性能提升:以少胜多的参数高效模型 在 ...
为什么现在做AI的产品经理,都是在亏钱?
3 6 Ke· 2025-05-06 01:50
Core Insights - The current landscape of AI product management is characterized by a focus on iterative improvements rather than creating products from scratch, leading to instability and financial losses for AI product managers [1][21] - The transformer model, while popular, is not necessarily the best architecture for AI applications, as it struggles with issues like hallucination and high training costs [2][5] - The emergence of alternative models, such as diffusion models and the yan model, indicates a shift in the AI landscape, with potential implications for product design and functionality [3][5] Group 1: AI Product Management Challenges - AI product managers are primarily engaged in API integration rather than developing proprietary models, limiting their ability to innovate and compete [6][8] - The high costs associated with AI model fine-tuning and infrastructure, including server costs and operational expenses, create significant barriers to profitability [9][10] - The user acquisition process for AI products still relies on traditional internet marketing strategies, which may not be sufficient to differentiate AI offerings in a crowded market [10][12] Group 2: User Perception and Market Dynamics - The transition of AI from a novelty to a necessity has not yet been fully realized, as the productivity gains from AI tools remain unclear [15][20] - Despite the potential of AI to assist in various tasks, the need for human oversight and correction limits the efficiency gains that users experience [17][21] - The willingness of users to pay for AI services is low, as many seek free alternatives or are hesitant to invest in AI tools that do not demonstrate clear value [21][22]
CVPR 2025 | 如何稳定且高效地生成个性化的多人图像?ID-Patch带来新解法
机器之心· 2025-05-03 04:18
Core Viewpoint - The article discusses the advancements and challenges in personalized multi-person image generation using diffusion models, highlighting the innovative ID-Patch mechanism that addresses identity leakage and enhances accuracy in positioning and identity representation [1][5][21]. Group 1: Challenges in Multi-Person Image Generation - Personalized single-person image generation has achieved impressive visual effects, but generating images with multiple people introduces complexities [4]. - Identity leakage is a significant challenge, where features of different individuals can blend, making it difficult to distinguish between them [2][4]. - Existing methods like OMG and InstantFamily have attempted to tackle identity confusion but face limitations in efficiency and accuracy, especially as the number of individuals increases [4][14]. Group 2: ID-Patch Mechanism - ID-Patch is a novel solution designed specifically for multi-person image generation, focusing on binding identity and position [6][21]. - The mechanism separates facial information into two key modules, allowing for precise placement of individuals while maintaining their unique identities [9][21]. - ID-Patch integrates various spatial conditions, such as pose and depth maps, enhancing its adaptability to complex scenes [10][21]. Group 3: Performance and Efficiency - ID-Patch demonstrates superior performance in identity resemblance (0.751) and identity-position matching (0.958), showcasing its effectiveness in maintaining facial consistency and accurate placement [12]. - In terms of generation speed, ID-Patch is the fastest among existing methods, generating an 8-person group photo in approximately 10 seconds, compared to nearly 2 minutes for OMG [17][15]. - The performance of ID-Patch remains robust even as the number of faces increases, with only a slight decline in effectiveness [14][21]. Group 4: Future Directions - There is potential for further improvement in facial feature representation by incorporating diverse images of the same identity under varying lighting and expressions [20]. - Future explorations may include enhancing facial fidelity through multi-angle images and achieving dual control over position and expression using patch technology [22].
阶跃星辰开源图像编辑模型Step1X-Edit;阿里巴巴AI旗舰应用夸克发布全新“AI相机”丨AIGC日报
创业邦· 2025-04-27 23:48
扫码订阅 AIGC 产业日报, 3.【Meta Token-Shuffle登场:自回归模型突破瓶颈,可AI生成 2048×2048 分辨率图像】报道称Meta AI创 新推出Token-Shuffle,目标解决自回归(Autoregressive,AR)模型在生成高分辨率图像方面的扩展难 题。在语言生成方面,自回归模型大放异彩,近年来也被广泛探索用于图像合成,然而在面对高分辨率 图像时,AR模型遭遇瓶颈。不同于文本生成仅需少量token,图像合成中高分辨率图片往往需要数千个 token,计算成本随之暴增。这让许多基于 AR 的多模态模型只能处理低中分辨率图像,限制了其在精细 图像生成中的应用。尽管扩散模型(Diffusion Models)在高分辨率上表现强劲,但其复杂的采样过程和 较慢的推理速度也存在局限。(搜狐) 4.【Adobe发布Firefly Image Model 4模型:AI生图再升级】Adobe发布博文,推出Firefly Image Model 4和 Firefly Image Model 4 Ultra两款文本生成图像AI模型,并预告针对Photoshop和Illustrator的Crea ...
ICLR 2025 | 无需训练加速20倍,清华朱军组提出用于图像翻译的扩散桥模型推理算法DBIM
机器之心· 2025-04-27 10:40
论文有两位共同一作。郑凯文为清华大学计算机系三年级博士生,何冠德为德州大学奥斯汀分校(UT Austin)一年级博士生。 扩散模型(Diffusion Models)近年来在生成任务上取得了突破性的进展,不仅在图像生成、视频合成、语音合成等领域都实现了卓越表现,推动了文本到图像、 视频生成的技术革新。然而,标准扩散模型的设计通常只适用于从随机噪声生成数据的任务,对于图像翻译或图像修复这类明确给定输入和输出之间映射关系的 任务并不适合。 为了解决这一问题,一种名为 去噪扩散桥模型 (Denoising Diffusion Bridge Models, DDBMs)的变种应运而生。DDBM 能够建模两个给定分布之间的桥接过程, 从而很好地应用于图像翻译、图像修复等任务。然而,这类模型在数学形式上依赖 复杂的常微分方程 / 随机微分方程 ,在生成高分辨率图像时通常需要 数百步的 迭代 , 计算效率低下 ,严重限制了其在实际中的广泛应用。 相比于标准扩散模型,扩散桥模型的推理过程 额外涉及初始条件相关的线性组合和起始点的奇异性 ,无法直接应用标准扩散模型的推理算法。为此,清华大学朱 军团队提出了一种名为 扩散桥隐式模 ...
“计算机视觉被GPT-4o终结了”(狗头)
量子位· 2025-03-29 07:46
Core Viewpoint - The article discusses the advancements in computer vision (CV) and image generation capabilities brought by the new GPT-4o model, highlighting its potential to disrupt existing tools and methodologies in the field [1][2]. Group 1: Technological Advancements - GPT-4o introduces native multimodal image generation, expanding the functionalities of AI tools beyond traditional applications [2][12]. - The image generation process in GPT-4o is based on a self-regressive model, differing from the diffusion model used in DALL·E, which allows for better adherence to instructions and enhanced image editing capabilities [15][19]. - Observations suggest that the image generation may involve a multi-scale self-regressive combination, where a rough image is generated first, followed by detail filling while the rough shape evolves [17][19]. Group 2: Industry Impact - The advancements in GPT-4o's capabilities have raised concerns among designers and computer vision researchers, indicating a significant shift in the competitive landscape of AI tools [6][10]. - OpenAI's approach of scaling foundational models to achieve these capabilities has surprised many in the industry, suggesting a new trend in AI development [12][19]. - The potential for GPT-4o to enhance applications in autonomous driving has been noted, with implications for future developments in this sector [10]. Group 3: Community Engagement - The article encourages community members to share their experiences and innovative uses of GPT-4o, fostering a collaborative environment for exploring AI applications [26].