Diffusion模型

Search documents
自回归模型杀回图像生成!实现像素级精准控制,比Diffusion更高效可控
量子位· 2025-07-29 05:05
Core Viewpoint - The article discusses the limitations of Diffusion models in AI image generation, particularly in precise control, and introduces a new framework called MENTOR, which utilizes Autoregressive (AR) models for more efficient and controllable multimodal image generation [1][2][3]. Group 1: Challenges in Current Models - Diffusion models face challenges in precise visual control, balancing multimodal inputs, and high training costs [2][6]. - The inherent randomness of Diffusion models makes it difficult to achieve precise control in high-fidelity tasks like image reconstruction [6]. - Existing methods often exhibit modality imbalance, over-relying on either reference images or text instructions [6]. Group 2: Introduction of MENTOR - MENTOR is a novel AR framework that requires only one-tenth of the training data and suboptimal model components to outperform Diffusion methods like Emu2 and DreamEngine [2][3]. - The framework employs a unique two-stage training method to enable efficient multimodal image generation with pixel-level precision [3][8]. Group 3: MENTOR's Design and Training - MENTOR features a unified AR architecture consisting of a multimodal encoder and an autoregressive generator, allowing for token-level alignment between inputs and outputs [9]. - The two-stage training strategy includes: 1. Multimodal Alignment Pretraining: Focuses on understanding different input types and establishing pixel-level and semantic alignment [10]. 2. Multimodal Instruction Tuning: Enhances the model's ability to follow instructions and reason across modalities [12]. Group 4: Performance and Efficiency - MENTOR achieved competitive performance on DreamBench++, surpassing larger models like Emu2 (37 billion parameters) and DreamEngine (10.5 billion parameters) while maintaining a lower CP/PF ratio, indicating better balance between visual feature preservation and prompt following [15][17]. - The training process for MENTOR utilized approximately 3 million image-text pairs over 1.5 days, demonstrating significant efficiency compared to other baseline methods [18]. Group 5: Applications and Future Potential - MENTOR's framework is highly versatile, capable of handling various complex multimodal generation tasks with minimal adjustments [24]. - The article concludes that MENTOR opens a new path for controllable image generation tasks, showcasing the potential of AR models in visual generation, while acknowledging that there are still areas where it lags behind top-tier Diffusion models [26].
TransDiffuser: 理想VLA diffusion出轨迹的架构
理想TOP2· 2025-05-18 13:08
Core Viewpoint - The article discusses the advancements in the field of autonomous driving, particularly focusing on the Diffusion model and its application in generating driving trajectories, highlighting the differences between VLM and VLA systems [1][4]. Group 1: Diffusion Model Explanation - Diffusion is a generative model that learns data distribution through a process of adding noise (Forward Process) and removing noise (Reverse Process), akin to a reverse puzzle [4]. - The model's denoising process involves training a neural network to predict and remove noise, ultimately generating target data [4]. - Diffusion not only generates the vehicle's trajectory but also predicts the trajectories of other vehicles and pedestrians, enhancing decision-making in complex traffic environments [5]. Group 2: VLM and VLA Systems - VLM consists of two systems: System 1 mimics learning to output trajectories without semantic understanding, while System 2 has semantic understanding but only provides suggestions [2]. - VLA is a single system with both fast and slow thinking capabilities, inherently possessing semantic reasoning [2]. - The output of VLA is action tokens that encode the vehicle's driving behavior and surrounding environment, which are then decoded into driving trajectories using the Diffusion model [4][5]. Group 3: TransDiffuser Architecture - TransDiffuser is an end-to-end trajectory generation model that integrates multi-modal perception information to produce high-quality, diverse trajectories [6][7]. - The architecture includes a Scene Encoder for processing multi-modal data and a Denoising Decoder that utilizes the DDPM framework for trajectory generation [7][9]. - The model employs a multi-head cross-attention mechanism to fuse scene and motion features during the denoising process [9]. Group 4: Performance and Innovations - The model achieves a Predictive Driver Model Score (PDMS) of 94.85, outperforming existing methods [11]. - Key innovations include anchor-free trajectory generation and a multi-modal representation decorrelation optimization mechanism to enhance trajectory diversity and reduce redundancy [11][12]. Group 5: Limitations and Future Directions - The authors note challenges in fine-tuning the model, particularly the perception encoder [13]. - Future directions involve integrating reinforcement learning and referencing models like OpenVLA for further advancements [13].