扩散模型
Search documents
李飞飞团队新作:简单调整生成顺序,大幅提升像素级图像生成质量
量子位· 2026-02-14 10:09
闻乐 发自 凹非寺 量子位 | 公众号 QbitAI 长期以来,AI生图被一个经典矛盾困扰。 潜空间模型效率高,但细节有损耗;像素空间模型保真度高,却容易结构混乱、速度慢。 要么快要没准,大家几乎默认这是架构带来的取舍问题,没法彻底解决。 但扩散模型生图,顺序真的对吗? 李飞飞团队最新论文提出的 Latent Forcing 方法直接打破了这一共识,他们发现 生成的质量瓶颈不在架构,而在顺序 。 简单说就像画画必须先打草稿再填色,AI也需要一个「先定结构、后填细节」的强制逻辑。 Latent Forcing仅通过重排生成轨迹,像素扩散模型不仅找回了效率,更在多项指标上刷新SOTA。 传统方法瓶颈 在深入了解Latent Forcing之前,咱先来说说当前两大方法的瓶颈。 传统像素级扩散模型之所以画图会画歪,是因为它在降噪过程中,高频的纹理细节往往会干扰低频的语义结构。 模型常常在还没搞清楚物体的整体轮廓时,就被迫去预测局部的像素颜色,其实这在本质上就违背了视觉生成的自然逻辑。 于是李飞飞团队思考—— 能不能既保留像素级的无损精度,又获得潜空间的结构引导? 先打个草稿 Latent Forcing的答案是—— ...
扩散语言模型深度思考
机器之心· 2026-02-08 10:37
以下文章来源于精博士小酒馆 ,作者王云鹤 写这个的时候,其实我脑子里第一反应是好多年以前某位领导问过我, transformer的下一跳是什么? 我当时 的回复是transformer是一个量变到质变长期积累得到的范式,很早期的视觉里面也有类似的nonlocal等,而且 卷积也在跟attention持续互补发挥作用。 diffusion本身也不算transformer的下一条,但是从建模方式上,可能 有潜力会对ar带来很大冲击。 很早就关注扩散语言模型了(diffusion language model,dllm),但是受限于精力和算力一直没机会深度思 考。从文本角度探索diffusion的架构相对当前比较好入手,并且这里面很多问题不解决,多模态的版本也不好 搞,所以我们会先聚焦dllm上的算法基础。 去年下半年陆陆续续开始在一些方向上有一些探索,受启发于某位内部专家,赶在元旦之前写了一篇算是洞察 材料的文章。 前几天在AAAI的报告重点介绍了团队的几个工作,包含next-block diffusion的训练,diffusion in diffusion的分 层结构,diffusion agent等。 相关P ...
何恺明带大二本科生颠覆扩散图像生成:扔掉多步采样和潜空间,一步像素直出
量子位· 2026-02-02 05:58
Core Viewpoint - The article discusses the introduction of a new method called Pixel Mean Flow (pMF), which simplifies the architecture of diffusion models by eliminating traditional components like multi-step sampling and latent space, allowing for direct image generation in pixel space [2][3][5]. Group 1: Methodology and Innovations - pMF achieves significant performance improvements, with a FID score of 2.22 at a resolution of 256×256 and 2.48 at 512×512, marking it as one of the best single-step, non-latent space diffusion models [4][27]. - The elimination of multi-step sampling and latent space reduces the complexity of the generation process, allowing for a more efficient architecture [6][36]. - The core design of pMF involves the network directly outputting pixel-level denoised images while using a velocity field to compute loss during training [13][25]. Group 2: Experimental Results - In experiments, the pMF model outperformed the previous method EPG, which had a FID of 8.82, demonstrating a substantial improvement in image generation quality [27]. - The addition of perceptual loss during training led to a reduction in FID from 9.56 to 3.53, showcasing the effectiveness of this approach [26]. - The computational efficiency of pMF is highlighted, as it requires significantly less computational power compared to GAN methods like StyleGAN-XL, which demands 1574 Gflops for each forward pass, while pMF-H/16 only requires 271 Gflops [27]. Group 3: Challenges and Future Directions - The integration of single-step and pixel space models presents increased challenges in architecture design, necessitating advanced solutions to handle the complexities involved [10][12]. - The article emphasizes that as model capabilities improve, the historical compromises of multi-step sampling and latent space encoding are becoming less necessary, encouraging further exploration of direct, end-to-end generative modeling [36].
拒绝Reward Hacking!港科联合快手可灵提出高效强化学习后训练扩散模型新范式
机器之心· 2026-01-25 02:35
Core Insights - The article discusses the challenges of using Reinforcement Learning (RL) to fine-tune diffusion models like Stable Diffusion, particularly the issue of "Reward Hacking" which can degrade image quality [2][5] - A new framework called GARDO (Gated and Adaptive Regularization with Diversity-aware Optimization) is introduced, which aims to prevent Reward Hacking while enhancing sample exploration and diversity generation [2][12] Background and Motivation - RL has shown promising results in visual tasks, but defining an ideal reward function is challenging, often leading to the use of proxy rewards that can result in Reward Hacking [5][4] - The article highlights the pitfalls of RL post-training, including low sample efficiency and hindered exploration due to static reference models [9][10] GARDO Framework - GARDO is designed to address the issues of Reward Hacking by implementing three core insights: 1. Gated KL Mechanism, which applies KL regularization only when the model generates samples in unreliable reward regions [14][15] 2. Adaptive Regularization Target, which updates the reference model to prevent optimization stagnation [17] 3. Diversity-Aware Advantage Shaping, which encourages diversity in generated samples to avoid mode collapse [18][19] Experimental Results - GARDO has been tested on various base models (SD3.5-Medium, Flux.1-dev) and demonstrated significant advantages over baseline methods like Flow-GRPO [20][21] - The framework effectively prevents Reward Hacking while maintaining high image quality and sample efficiency, achieving better performance with fewer training steps [22][23] Emergent Behavior - GARDO has shown the ability to generate a higher number of objects in challenging tasks, indicating its potential to unlock new capabilities in visual generation [24][25] Conclusion - The work emphasizes that precise control is more important than strict constraints in visual generation using RL, making GARDO a valuable framework for researchers and developers looking to leverage RL in diffusion models [27]
AI芯片格局
傅里叶的猫· 2026-01-24 15:52
Core Insights - The article discusses the evolving landscape of AI chips, particularly focusing on the rise of TPU and its implications for major tech companies like Google, OpenAI, and Apple [3][5][7]. TPU's Rise - TPU is gaining traction as a significant player in the AI training and inference market, challenging NVIDIA's long-standing GPU dominance [3]. - Major companies like OpenAI and Apple are increasingly adopting TPU for their core operations, indicating a shift in the competitive landscape [3][4]. - The transition from GPU to TPU involves complex technical adaptations, which can lead to high costs and extended timelines for companies [4][6]. Supply and Demand Challenges - There is currently a 50% supply gap in the global AI computing power market, driven by surging demand for TPU [5]. - This supply shortage is causing delays in projects and increasing costs for companies relying on TPU, particularly affecting TSMC, the main foundry for TPU [5]. - The immature software ecosystem surrounding TPU, particularly its incompatibility with the widely used CUDA framework, poses additional challenges for widespread adoption [5][6]. TPU vs. AWS Trainium - Google’s TPU has a hardware-level optimization for matrix and tensor operations, providing significant efficiency advantages over AWS's Trainium, which lacks such integration [7]. - Trainium's reliance on external libraries for operations increases resource consumption and limits efficiency, particularly in large-scale deployments [7]. - Both companies have different strengths in network adaptation, with Google focusing on vertical scaling and AWS on horizontal scaling, leading to a differentiated competitive landscape [8]. Oracle's Unexpected Rise - Oracle has emerged as a key player in the chip market by leveraging government policies and strategic partnerships to secure high-end chip supplies [9][10]. - The company has formed partnerships with government entities and other service providers to monopolize certain chip markets, creating a dual resource barrier [10]. - Oracle's collaboration with OpenAI for a $300 billion computing resource deal highlights its strategy to profit from reselling computing power [10]. OpenAI's Financial and Operational Challenges - OpenAI faces a significant funding gap, with annual revenues of approximately $12 billion against a projected investment need of $300 billion for expansion [14]. - The company’s reliance on venture capital and the increasing costs of computing power exacerbate its financial pressures [14]. - OpenAI's business model struggles with low profitability in its core LLM inference business, necessitating a delicate balance between pricing and user retention [15]. Future of Large Models - The industry is witnessing diminishing returns on performance improvements as model sizes increase, while the costs of computing power rise exponentially [17]. - Resource constraints, particularly in power supply and dependency on NVIDIA, are becoming critical bottlenecks for large model development [17][18]. - Future developments in large models are expected to focus on more efficient and diverse technological paths, moving away from mere parameter competition [18][19]. Conclusion - The competition in AI chips and computing power is a battle for industry dominance, with companies like Google, Oracle, and OpenAI navigating complex challenges and opportunities [19][20]. - The market is expected to stabilize as supply chains improve, but the ability to monetize technology and integrate it into practical applications will be crucial for long-term success [20].
中游智驾厂商,正在快速抢占端到端人才......
自动驾驶之心· 2026-01-16 02:58
Core Viewpoint - The article discusses the technological anxiety in the intelligent driving sector, particularly among midstream manufacturers, highlighting a slowdown in cutting-edge technology development and a trend towards standardized mass production solutions [1][2]. Group 1: Industry Trends - The mass production of cutting-edge technologies is expected to begin in 2026, with current advancements in intelligent driving technology stagnating [2]. - The overall market for passenger vehicles priced above 200,000 is around 7 million units, but leading new forces have not achieved even one-third of this volume [2]. - The maturity of end-to-end technology is seen as a prerequisite for larger-scale mass production, especially with the advancement of L3 regulations this year [2]. Group 2: Educational Initiatives - A course titled "Practical Class for End-to-End Mass Production" has been launched, focusing on the necessary technical capabilities for mass production in intelligent driving [2]. - The course emphasizes practical applications and is limited to a small number of participants, with only 8 spots remaining [2]. Group 3: Course Content Overview - The course covers various aspects of end-to-end algorithms, including: - Overview of end-to-end tasks, merging perception tasks, and designing learning-based control algorithms [7]. - Two-stage end-to-end algorithm frameworks, including modeling and information transfer between perception and planning [8]. - One-stage end-to-end algorithms that allow for lossless information transfer, enhancing performance [9]. - The application of navigation information in autonomous driving, including map formats and encoding methods [10]. - Introduction to reinforcement learning algorithms to complement imitation learning in driving behavior [11]. - Optimization of trajectory outputs through practical projects involving imitation and reinforcement learning [12]. - Post-processing logic for trajectory smoothing to ensure stability and reliability in mass production [13]. - Sharing of mass production experiences from multiple perspectives, including data, models, and rules [14]. Group 4: Target Audience - The course is aimed at advanced learners with a foundational understanding of autonomous driving algorithms, reinforcement learning, and programming skills [15]. - Participants are expected to have access to a GPU with a recommended capability of 4090 or higher and familiarity with various algorithm frameworks [18].
西湖大学提出RDPO强化学习框架,实现扩散模型并行推理加速
量子位· 2026-01-13 07:21
非羊 整理自 凹非寺 量子位 | 公众号 QbitAI 用扩散模型 (比如Stable Diffusion) 一张张"挤"出高分辨率图像的时代,正在被世界模型实时生成高清视频的浪潮冲刷。 但无论图像还是视频,扩散模型骨子里的"顺序去噪"过程,就像一场无法并行的接力赛,成为速度提升的终极瓶颈。 如何在不伤及模型"绘画功力"的前提下,为它装上加速引擎? 西湖大学AGI Lab提出的 RDPO(残差狄利克雷策略优化)框架 ,给出了一种巧妙的答案: 不必改动模型本身,而是优化它的"采样导航 系统" 。 重要的是,由于额外的梯度计算是 独立 的,它们可以完全 并行化 ,从而保持 低延迟采样 的特性。 团队引入了一个 两阶段优化框架 :最初,EPD-Solver通过基于 蒸馏 的方法优化一小组可学习参数;随后,团队进一步提出了一种参数高 效的强化学习微调框架 RDPO ,将求解器重新构建为随机的狄利克雷 (Dirichlet) 策略。 与微调庞大骨干网络的传统方法不同,团队的RL方法严格在 低维求解器空间 内运行,在增强复杂文本到图像 (T2I) 生成任务性能的同 时,有效缓解了奖励作弊 (Reward Hacking) ...
最近会开放一批端到端&VLA的岗位需求
自动驾驶之心· 2026-01-12 03:15
Core Insights - The consensus among industry experts indicates that 2026 will be a pivotal year for the development of end-to-end (E2E) and VLA (Vision-Language Alignment) technologies in autonomous driving, with a focus on optimizing production processes rather than making significant algorithmic changes [1] - The industry is actively recruiting experienced algorithm engineers and developing talent to tackle the complex challenges ahead, particularly in areas such as BEV perception, large models, diffusion models, and reinforcement learning [1] Course Overview - The course on E2E and VLA autonomous driving is designed to provide a comprehensive learning path from principles to practical applications, developed in collaboration with industry leaders [3] - The course covers various aspects of E2E algorithms, including their historical development, advantages and disadvantages of different paradigms, and current trends in both academia and industry [6][7] - Key technical keywords that are expected to be frequently encountered in job interviews over the next two years are emphasized in the course content [7] Course Structure - Chapter 1 introduces the concept of E2E algorithms, discussing their evolution from modular approaches to current paradigms like VLA [6] - Chapter 2 focuses on the background knowledge necessary for understanding E2E technologies, including VLA, large language models, diffusion models, and reinforcement learning [11] - Chapter 3 delves into two-stage E2E algorithms, exploring their emergence and comparing them with one-stage approaches [7] - Chapter 4 presents one-stage E2E algorithms and VLA, highlighting various subfields and their contributions to achieving the ultimate goals of E2E systems [8] - Chapter 5 involves a practical assignment on RLHF (Reinforcement Learning from Human Feedback) fine-tuning, demonstrating how to build and experiment with pre-training and reinforcement learning modules [9] Learning Outcomes - The course aims to elevate participants to the level of an E2E autonomous driving algorithm engineer within approximately one year, covering a wide range of methodologies including one-stage, two-stage, world models, and diffusion models [15] - Participants will gain a deeper understanding of key technologies such as BEV perception, multimodal large models, reinforcement learning, and diffusion models, enabling them to apply their knowledge in real-world projects [15]
小模型层数好玄学:12/32/64层效果好,16/24/48/层效果糟
量子位· 2026-01-11 04:02
Core Insights - The article reveals significant findings regarding the 70M small model, emphasizing that the architecture's importance is lower than previously thought, while the model's "shape" (depth-width ratio) is more critical [1][2]. Group 1: Model Architecture and Performance - The optimal number of layers for small models is identified as 32, with 12 and 64 layers also performing well, while configurations with 16, 24, and 48 layers yield poor results [2][15]. - The performance gap between "good" and "bad" layer configurations exceeds 6 percentage points, with "good" configurations averaging around 38% accuracy and "bad" configurations around 32% [15][16]. - The hidden dimension must be at least 512 for optimal performance, with the 32-layer configuration achieving the highest score of 38.50% [18][23]. Group 2: Comparative Analysis of Architectures - A comparison of 12 different architectures, including LLaMA3 and Qwen3, shows that modern architectures perform similarly within the 70M parameter range, with average differences of less than 2% [25][26]. - The article notes that improvements in modern architectures are primarily designed for models with over 700 million parameters and do not provide measurable advantages for 70M models [27]. Group 3: Diffusion Models vs. Autoregressive Models - Diffusion models, while slightly lower in average accuracy (31-32%), demonstrate faster inference speeds (3.8 times faster) and lower hallucination rates compared to autoregressive models [28][30]. - The introduction of a "Canon layer" can enhance factual accuracy by 1% for autoregressive models and over 2% for diffusion models, with minimal additional parameter cost [35][36]. Group 4: New Model Development - The Dhara-70M model is introduced, combining the best features of autoregressive and diffusion models, built on the LLaMA3-Canon architecture and converted using the WSD method [41][42]. - The specifications of Dhara-70M include 71.34M parameters, 32 layers, and a hidden size of 384, designed for high throughput and factual accuracy [44]. Group 5: Recommendations for Model Builders - The article advises small language model builders to focus on the fundamental depth-width ratio rather than chasing the latest architectural trends, especially for applications requiring high-speed processing and factual accuracy [45].
随到随学!端到端与VLA自动驾驶小班课(视频+答疑)
自动驾驶之心· 2026-01-08 05:58
Core Viewpoint - The article discusses an advanced course on end-to-end (E2E) autonomous driving, focusing on the latest technologies such as BEV perception, Visual Language Models (VLM), diffusion models, and reinforcement learning, aimed at equipping participants with cutting-edge skills in the field [1][4][8]. Group 1: Course Structure - The course is divided into several chapters, starting with an introduction to end-to-end algorithms, covering the historical development and advantages of E2E methods over modular approaches [4]. - The second chapter focuses on background knowledge essential for understanding E2E technologies, including VLA, diffusion models, and reinforcement learning, which are crucial for job interviews in the next two years [5][9]. - The third chapter delves into two-stage E2E methods, discussing their emergence, advantages, and notable algorithms like PLUTO and CarPlanner [5][6]. - The fourth chapter highlights one-stage E2E methods and VLA, exploring various subfields and their contributions to achieving the ultimate goals of E2E systems [6][10]. Group 2: Practical Application - The course includes a major project on RLHF fine-tuning, allowing participants to apply their knowledge in practical scenarios, including building pre-training and reinforcement learning modules [7]. - The course aims to help participants reach a level equivalent to one year of experience as an E2E autonomous driving algorithm engineer, covering various methodologies and key technologies [13]. Group 3: Target Audience and Requirements - The course is designed for individuals with a foundational understanding of autonomous driving, familiar with basic modules, and concepts like transformer models, reinforcement learning, and BEV perception [11]. - Participants are expected to have a background in probability theory and linear algebra, as well as proficiency in Python and PyTorch [11].