Workflow
扩散模型
icon
Search documents
CVPR 2026 | 给扩散模型装上「物理引擎」: 北大彭宇新团队提出NS-Diff,使扩散模型学会流体与刚体力学
机器之心· 2026-03-19 01:25
Core Viewpoint - The article discusses the latest research by Professor Peng Yuxin's team from Peking University on a new video generation framework called NS-Diff, which integrates physical constraints into video diffusion models to enhance the physical realism of generated videos [2][7][30]. Background and Motivation - Current AI video generation models, while visually impressive, often fail to adhere to the physical laws of the real world, leading to unrealistic "blooper" moments in generated content [5][7]. - The challenge lies in bridging the gap between "visual realism" and "physical realism" in AI-generated content [7]. Research Contributions - NS-Diff introduces a framework that combines physical constraints with reinforcement learning to guide video generation, ensuring that the generated frames not only look good but also follow physical laws [7][30]. - Key components of NS-Diff include: 1. A noise-robust physical dynamics detector that accurately analyzes motion information in noisy environments [8]. 2. A physical condition latent injection module that encodes key physical information and integrates it into the denoising process [13]. 3. A reinforcement learning optimization module that applies simplified Navier-Stokes constraints and minimizes jerk to ensure physical plausibility in dynamic processes [15][17]. Experimental Results - NS-Diff demonstrated superior performance across various metrics in the PhysVideoBench and UCF-101 datasets, achieving a 43% reduction in jerk error and a 33% reduction in fluid divergence [23][24]. - The model's Fréchet Video Distance (FVD) improved by 22.7%, indicating enhanced physical realism and visual quality [23]. - In the UCF-101 benchmark, NS-Diff achieved an FVD of 106 and frame consistency of 0.94, outperforming existing methods [24]. Conclusion - The research indicates that integrating classical physical constraints deeply into generative models is an effective approach to address physical distortion issues in video generation [30].
统一离散与连续扩散!人大 & 蚂蚁提出 LLaDA-o,高效达成多模态理解与生成
机器之心· 2026-03-14 04:03
Core Insights - The article discusses the development of LLaDA-o, an efficient and length-adaptive omni diffusion model, which addresses the challenges of integrating discrete text diffusion and continuous image diffusion into a unified framework [3][19]. Group 1: Model Performance - LLaDA-o achieves state-of-the-art (SOTA) performance in both multi-modal understanding and text-to-image generation tasks, marking a significant advancement in the field of multi-modal diffusion models [3][19]. - In multi-modal understanding benchmarks, LLaDA-o outperforms existing diffusion models, achieving notable scores such as 66.1 in MathVista and 87.9 in ChartQA, solidifying its position as the leading model in this category [7][9]. - The model also excels in fine-grained generation tasks, scoring 87.04 in DPG-Bench, surpassing previous strong models like SD3-Medium and Lumina-DiMOO [9][11]. Group 2: Technical Innovations - LLaDA-o employs a Mixture of Diffusion (MoD) framework, which features two specialized diffusion experts: an Understanding Expert for discrete masked diffusion and a Generation Expert for continuous diffusion, allowing for effective optimization across different modalities [12][14]. - The model incorporates intra-modality bidirectional attention to enhance efficiency by reducing redundant calculations during inference, thus improving overall performance [15]. - An adaptive length augmentation strategy is introduced, enabling the model to dynamically adjust output lengths based on context, addressing the challenges of variable-length text generation without altering the underlying architecture [17]. Group 3: Future Implications - The successful integration of discrete language understanding and continuous visual generation within the MoD framework positions LLaDA-o as a strong contender against autoregressive models, paving the way for future developments in non-autoregressive architectures [19][20]. - The ongoing evolution of large language diffusion models suggests that unified models based on diffusion architecture will play a crucial role in the landscape of general artificial intelligence [20].
扩散模型终于学会「看题下菜碟」!根据提示词难度动态分配算力,简单题省时复杂题保画质
量子位· 2026-03-09 10:05
Core Viewpoint - The article discusses the introduction of a new framework called "CoTj" by China Unicom's Data Science and AI Research Institute, which enhances diffusion models' ability to dynamically allocate computational resources based on the complexity of prompts, significantly improving image generation quality [4][35]. Group 1: Framework and Mechanism - The CoTj framework allows diffusion models to possess "System 2" planning capabilities, enabling them to allocate computational resources dynamically according to the complexity of the prompts [4][14]. - CoTj employs a "Predict-Plan-Execute" reasoning paradigm, featuring a lightweight predictor that estimates the current Diffusion DNA from condition embeddings, achieving rapid predictions [14][15]. - The framework transforms complex sampling processes into a directed acyclic graph (DAG) optimization problem, allowing for efficient trajectory planning [11][13]. Group 2: Performance and Results - In experiments, CoTj demonstrated superior image quality even with a basic first-order solver, outperforming traditional methods that used high-order solvers under the same conditions [22][24]. - The framework achieved significant improvements in accuracy and speed across various models, with notable metrics such as a 60% reduction in mean squared error (MSE) and over 6 dB increase in peak signal-to-noise ratio (PSNR) [25][28]. - CoTj's trajectory planning allows for high fidelity in image generation, even with drastically reduced sampling steps, maintaining essential details that traditional methods often lose [27][29]. Group 3: Future Directions - The research team indicates that the theoretical foundation of CoTj will be expanded to more complex video dynamics and will explore unsupervised Diffusion DNA discovery across modalities [36][37]. - The framework represents a significant leap in computational efficiency and resource-aware planning in generative AI, marking a new era for diffusion models [35][36].
ICLR2026 Oral | 北大彭一杰团队提出高效优化新范式,递归似然比梯度优化器赋能扩散模型后训练
机器之心· 2026-03-09 03:58
Core Viewpoint - The article discusses the introduction of the Recursive Likelihood Ratio (RLR) optimizer by Professor Peng Yijie’s team from Peking University, which offers a new semi-gradient fine-tuning solution for diffusion models, addressing the challenges of efficiency and performance in downstream applications [2][10]. Group 1: Background and Challenges - Diffusion models (DM) have become a core framework for image synthesis and video generation due to their high-fidelity data generation capabilities [2]. - The main challenge in the industry is how to efficiently adapt pre-trained diffusion models to meet specific application requirements [2]. - Current mainstream fine-tuning methods are divided into two categories: reinforcement learning (RL) methods and truncated backpropagation (BP) methods, both of which have significant drawbacks [7]. - Truncated BP methods can lead to structural bias in gradient estimation, potentially causing model collapse and content degradation [7]. - RL methods, while reducing memory requirements, suffer from high variance in gradient estimation and slow convergence [7]. Group 2: RLR Optimizer Design - The RLR optimizer introduces a semi-gradient estimation paradigm that utilizes the inherent noise characteristics of diffusion models to achieve unbiased and low-variance gradient estimation [10]. - The core design of the RLR optimizer includes three main modules: 1. First-order estimation module that directly backpropagates through the reward model at the first time step [11]. 2. Zero-order estimation module that employs parameter perturbation strategies for remaining time steps, ensuring unbiased gradient estimation without caching intermediate latent variables [12]. - The optimizer's controllable parameter, the local sub-chain length (h), directly influences the trade-off between memory usage and gradient variance [14]. Group 3: Performance Validation - The effectiveness of the RLR optimizer was validated through large-scale experiments on Text2Image and Text2Video tasks, showing superior performance compared to existing RL and truncated BP methods [18]. - In the Text2Image task, RLR improved the ImageReward score of Stable Diffusion 1.4 from 32.90 to 76.55, outperforming DDPO by approximately 47% and AlignProp by about 14% [18]. - In the Text2Video task, RLR achieved a weighted average score of 84.63, surpassing other models like VideoCrafter and Gen-2, particularly excelling in the dynamic degree metric [18][20]. - The RLR optimizer also incorporates a diffusion thinking chain prompt technique, which enhances performance in fine-grained tasks such as hand generation by targeting specific scales of generation defects [22].
李飞飞团队新作:简单调整生成顺序,大幅提升像素级图像生成质量
量子位· 2026-02-14 10:09
Core Viewpoint - The article discusses the breakthrough of the Latent Forcing method proposed by Li Fei-Fei's team, which challenges the traditional understanding of AI image generation by emphasizing the importance of the sequence in the generation process rather than the architecture itself [4][6]. Group 1: Traditional Methods and Their Limitations - Traditional pixel-level diffusion models struggle with generating accurate images due to interference between high-frequency texture details and low-frequency semantic structures during the denoising process [8][12]. - The industry has largely shifted towards latent space models to overcome these limitations, which compress images into lower-dimensional spaces for faster generation, but this approach introduces reconstruction errors and loses the ability to model raw data end-to-end [10][12]. Group 2: Latent Forcing Method - Latent Forcing reorders the diffusion trajectory to retain pixel-level lossless precision while gaining structural guidance from latent space [14][26]. - The method introduces a dual time variable mechanism, allowing the model to process both pixel and latent variables simultaneously, with a customized denoising rhythm for each [16][19]. - In the initial generation phase, latent variables establish the semantic structure before pixel details are refined, resulting in a final output that is 100% lossless without any decoder [20][21]. Group 3: Performance Metrics - Latent Forcing has demonstrated superior performance on the ImageNet leaderboard, achieving a conditional generation FID score of 9.76, significantly improved from the previous best score of 18.60 [22]. - In a 200-epoch training scenario, Latent Forcing achieved a conditional generation FID of 2.48 and an unconditional generation FID of 7.2, setting a new state-of-the-art for pixel space diffusion Transformers [23][24]. Group 4: Research Team - The Latent Forcing project is led by Li Fei-Fei, with contributions from Stanford co-authors Eric Ryan Chan, Kyle Sargent, Changan Chen, and Ehsan Adeli, as well as collaboration from Michigan University professor Justin Johnson [27][28][29].
扩散语言模型深度思考
机器之心· 2026-02-08 10:37
Core Viewpoint - The article discusses the potential of diffusion language models (DLLM) and their implications for artificial intelligence, emphasizing the need for improvements in architecture, tokenizer, optimization, and data engineering to enhance their efficiency and effectiveness [4][5][6]. Group 1: Architectural Improvements - Current diffusion models utilize autoregressive (AR) frameworks, which limit their efficiency due to the random masking of tokens that disrupts the reuse of key-value (kv) caches [6][9]. - A more suitable attention structure or a structured masking approach is needed to enhance the inference efficiency of diffusion models while maintaining their decoding advantages [6][9]. Group 2: Tokenization Strategies - The ideal diffusion model should not strictly follow AR paradigms but should adopt a more structured approach to tokenization, allowing for different granularities in processing [9][10]. - A hierarchical tokenizer could improve the model's ability to generate coherent outlines and detailed content, enhancing overall performance [9][10]. Group 3: Optimization Techniques - Diffusion models face challenges in gradient computation efficiency, particularly when dealing with long sequences where only a few tokens are masked [10][11]. - Introducing more structured masking and dynamic adjustments during training could improve the model's performance and reduce computational overhead [10][11]. Group 4: Output Length Adaptation - Current diffusion models require predefined output lengths, which can lead to inefficiencies in generating responses [10]. - Exploring methods to dynamically infer optimal output lengths during inference could enhance the model's adaptability and efficiency [10]. Group 5: Data Engineering - Most diffusion models currently rely on data from AR models, which may not fully leverage the potential of diffusion techniques [10][11]. - Enhancing the training data with structured masking and positional information could improve the learning efficiency of diffusion models [10][11]. Group 6: Model Efficiency - There is a need to improve the overall inference efficiency of diffusion models, especially as batch sizes increase [10]. - Exploring techniques such as multi-step distillation and low-bit quantization could help reduce inference costs while maintaining performance [10][11]. Group 7: Reasoning and Latent Thinking - The potential for deeper reasoning and implicit thinking in diffusion models remains underexplored, particularly in the context of structured thinking chains [10][11]. - Utilizing remasking during the denoising process could enhance the model's ability to refine outputs based on confidence levels [10][11]. Group 8: Prompt Engineering - The article suggests that adapting prompt formats to better suit diffusion models could lead to more efficient decoding and reasoning processes [10][11]. - Transitioning from traditional question-answer prompts to fill-in-the-blank styles may enhance the model's performance in generating relevant responses [10][11]. Group 9: Future Unified Architectures - The future of AI may benefit from a unified architecture that integrates various modalities, leveraging the strengths of both AR and diffusion models [10][11]. - Exploring the integration of discrete diffusion models with existing frameworks could unlock new capabilities in multi-modal tasks [10][11].
何恺明带大二本科生颠覆扩散图像生成:扔掉多步采样和潜空间,一步像素直出
量子位· 2026-02-02 05:58
Core Viewpoint - The article discusses the introduction of a new method called Pixel Mean Flow (pMF), which simplifies the architecture of diffusion models by eliminating traditional components like multi-step sampling and latent space, allowing for direct image generation in pixel space [2][3][5]. Group 1: Methodology and Innovations - pMF achieves significant performance improvements, with a FID score of 2.22 at a resolution of 256×256 and 2.48 at 512×512, marking it as one of the best single-step, non-latent space diffusion models [4][27]. - The elimination of multi-step sampling and latent space reduces the complexity of the generation process, allowing for a more efficient architecture [6][36]. - The core design of pMF involves the network directly outputting pixel-level denoised images while using a velocity field to compute loss during training [13][25]. Group 2: Experimental Results - In experiments, the pMF model outperformed the previous method EPG, which had a FID of 8.82, demonstrating a substantial improvement in image generation quality [27]. - The addition of perceptual loss during training led to a reduction in FID from 9.56 to 3.53, showcasing the effectiveness of this approach [26]. - The computational efficiency of pMF is highlighted, as it requires significantly less computational power compared to GAN methods like StyleGAN-XL, which demands 1574 Gflops for each forward pass, while pMF-H/16 only requires 271 Gflops [27]. Group 3: Challenges and Future Directions - The integration of single-step and pixel space models presents increased challenges in architecture design, necessitating advanced solutions to handle the complexities involved [10][12]. - The article emphasizes that as model capabilities improve, the historical compromises of multi-step sampling and latent space encoding are becoming less necessary, encouraging further exploration of direct, end-to-end generative modeling [36].
拒绝Reward Hacking!港科联合快手可灵提出高效强化学习后训练扩散模型新范式
机器之心· 2026-01-25 02:35
Core Insights - The article discusses the challenges of using Reinforcement Learning (RL) to fine-tune diffusion models like Stable Diffusion, particularly the issue of "Reward Hacking" which can degrade image quality [2][5] - A new framework called GARDO (Gated and Adaptive Regularization with Diversity-aware Optimization) is introduced, which aims to prevent Reward Hacking while enhancing sample exploration and diversity generation [2][12] Background and Motivation - RL has shown promising results in visual tasks, but defining an ideal reward function is challenging, often leading to the use of proxy rewards that can result in Reward Hacking [5][4] - The article highlights the pitfalls of RL post-training, including low sample efficiency and hindered exploration due to static reference models [9][10] GARDO Framework - GARDO is designed to address the issues of Reward Hacking by implementing three core insights: 1. Gated KL Mechanism, which applies KL regularization only when the model generates samples in unreliable reward regions [14][15] 2. Adaptive Regularization Target, which updates the reference model to prevent optimization stagnation [17] 3. Diversity-Aware Advantage Shaping, which encourages diversity in generated samples to avoid mode collapse [18][19] Experimental Results - GARDO has been tested on various base models (SD3.5-Medium, Flux.1-dev) and demonstrated significant advantages over baseline methods like Flow-GRPO [20][21] - The framework effectively prevents Reward Hacking while maintaining high image quality and sample efficiency, achieving better performance with fewer training steps [22][23] Emergent Behavior - GARDO has shown the ability to generate a higher number of objects in challenging tasks, indicating its potential to unlock new capabilities in visual generation [24][25] Conclusion - The work emphasizes that precise control is more important than strict constraints in visual generation using RL, making GARDO a valuable framework for researchers and developers looking to leverage RL in diffusion models [27]
AI芯片格局
傅里叶的猫· 2026-01-24 15:52
Core Insights - The article discusses the evolving landscape of AI chips, particularly focusing on the rise of TPU and its implications for major tech companies like Google, OpenAI, and Apple [3][5][7]. TPU's Rise - TPU is gaining traction as a significant player in the AI training and inference market, challenging NVIDIA's long-standing GPU dominance [3]. - Major companies like OpenAI and Apple are increasingly adopting TPU for their core operations, indicating a shift in the competitive landscape [3][4]. - The transition from GPU to TPU involves complex technical adaptations, which can lead to high costs and extended timelines for companies [4][6]. Supply and Demand Challenges - There is currently a 50% supply gap in the global AI computing power market, driven by surging demand for TPU [5]. - This supply shortage is causing delays in projects and increasing costs for companies relying on TPU, particularly affecting TSMC, the main foundry for TPU [5]. - The immature software ecosystem surrounding TPU, particularly its incompatibility with the widely used CUDA framework, poses additional challenges for widespread adoption [5][6]. TPU vs. AWS Trainium - Google’s TPU has a hardware-level optimization for matrix and tensor operations, providing significant efficiency advantages over AWS's Trainium, which lacks such integration [7]. - Trainium's reliance on external libraries for operations increases resource consumption and limits efficiency, particularly in large-scale deployments [7]. - Both companies have different strengths in network adaptation, with Google focusing on vertical scaling and AWS on horizontal scaling, leading to a differentiated competitive landscape [8]. Oracle's Unexpected Rise - Oracle has emerged as a key player in the chip market by leveraging government policies and strategic partnerships to secure high-end chip supplies [9][10]. - The company has formed partnerships with government entities and other service providers to monopolize certain chip markets, creating a dual resource barrier [10]. - Oracle's collaboration with OpenAI for a $300 billion computing resource deal highlights its strategy to profit from reselling computing power [10]. OpenAI's Financial and Operational Challenges - OpenAI faces a significant funding gap, with annual revenues of approximately $12 billion against a projected investment need of $300 billion for expansion [14]. - The company’s reliance on venture capital and the increasing costs of computing power exacerbate its financial pressures [14]. - OpenAI's business model struggles with low profitability in its core LLM inference business, necessitating a delicate balance between pricing and user retention [15]. Future of Large Models - The industry is witnessing diminishing returns on performance improvements as model sizes increase, while the costs of computing power rise exponentially [17]. - Resource constraints, particularly in power supply and dependency on NVIDIA, are becoming critical bottlenecks for large model development [17][18]. - Future developments in large models are expected to focus on more efficient and diverse technological paths, moving away from mere parameter competition [18][19]. Conclusion - The competition in AI chips and computing power is a battle for industry dominance, with companies like Google, Oracle, and OpenAI navigating complex challenges and opportunities [19][20]. - The market is expected to stabilize as supply chains improve, but the ability to monetize technology and integrate it into practical applications will be crucial for long-term success [20].
中游智驾厂商,正在快速抢占端到端人才......
自动驾驶之心· 2026-01-16 02:58
Core Viewpoint - The article discusses the technological anxiety in the intelligent driving sector, particularly among midstream manufacturers, highlighting a slowdown in cutting-edge technology development and a trend towards standardized mass production solutions [1][2]. Group 1: Industry Trends - The mass production of cutting-edge technologies is expected to begin in 2026, with current advancements in intelligent driving technology stagnating [2]. - The overall market for passenger vehicles priced above 200,000 is around 7 million units, but leading new forces have not achieved even one-third of this volume [2]. - The maturity of end-to-end technology is seen as a prerequisite for larger-scale mass production, especially with the advancement of L3 regulations this year [2]. Group 2: Educational Initiatives - A course titled "Practical Class for End-to-End Mass Production" has been launched, focusing on the necessary technical capabilities for mass production in intelligent driving [2]. - The course emphasizes practical applications and is limited to a small number of participants, with only 8 spots remaining [2]. Group 3: Course Content Overview - The course covers various aspects of end-to-end algorithms, including: - Overview of end-to-end tasks, merging perception tasks, and designing learning-based control algorithms [7]. - Two-stage end-to-end algorithm frameworks, including modeling and information transfer between perception and planning [8]. - One-stage end-to-end algorithms that allow for lossless information transfer, enhancing performance [9]. - The application of navigation information in autonomous driving, including map formats and encoding methods [10]. - Introduction to reinforcement learning algorithms to complement imitation learning in driving behavior [11]. - Optimization of trajectory outputs through practical projects involving imitation and reinforcement learning [12]. - Post-processing logic for trajectory smoothing to ensure stability and reliability in mass production [13]. - Sharing of mass production experiences from multiple perspectives, including data, models, and rules [14]. Group 4: Target Audience - The course is aimed at advanced learners with a foundational understanding of autonomous driving algorithms, reinforcement learning, and programming skills [15]. - Participants are expected to have access to a GPU with a recommended capability of 4090 or higher and familiarity with various algorithm frameworks [18].