Workflow
Emu2
icon
Search documents
自回归模型杀回图像生成!实现像素级精准控制,比Diffusion更高效可控
量子位· 2025-07-29 05:05
Core Viewpoint - The article discusses the limitations of Diffusion models in AI image generation, particularly in precise control, and introduces a new framework called MENTOR, which utilizes Autoregressive (AR) models for more efficient and controllable multimodal image generation [1][2][3]. Group 1: Challenges in Current Models - Diffusion models face challenges in precise visual control, balancing multimodal inputs, and high training costs [2][6]. - The inherent randomness of Diffusion models makes it difficult to achieve precise control in high-fidelity tasks like image reconstruction [6]. - Existing methods often exhibit modality imbalance, over-relying on either reference images or text instructions [6]. Group 2: Introduction of MENTOR - MENTOR is a novel AR framework that requires only one-tenth of the training data and suboptimal model components to outperform Diffusion methods like Emu2 and DreamEngine [2][3]. - The framework employs a unique two-stage training method to enable efficient multimodal image generation with pixel-level precision [3][8]. Group 3: MENTOR's Design and Training - MENTOR features a unified AR architecture consisting of a multimodal encoder and an autoregressive generator, allowing for token-level alignment between inputs and outputs [9]. - The two-stage training strategy includes: 1. Multimodal Alignment Pretraining: Focuses on understanding different input types and establishing pixel-level and semantic alignment [10]. 2. Multimodal Instruction Tuning: Enhances the model's ability to follow instructions and reason across modalities [12]. Group 4: Performance and Efficiency - MENTOR achieved competitive performance on DreamBench++, surpassing larger models like Emu2 (37 billion parameters) and DreamEngine (10.5 billion parameters) while maintaining a lower CP/PF ratio, indicating better balance between visual feature preservation and prompt following [15][17]. - The training process for MENTOR utilized approximately 3 million image-text pairs over 1.5 days, demonstrating significant efficiency compared to other baseline methods [18]. Group 5: Applications and Future Potential - MENTOR's framework is highly versatile, capable of handling various complex multimodal generation tasks with minimal adjustments [24]. - The article concludes that MENTOR opens a new path for controllable image generation tasks, showcasing the potential of AR models in visual generation, while acknowledging that there are still areas where it lags behind top-tier Diffusion models [26].
知识类型视角切入,全面评测图像编辑模型推理能力:所有模型在「程序性推理」方面表现不佳
量子位· 2025-06-13 05:07
Core Viewpoint - The article discusses the development of KRIS-Bench, a benchmark for evaluating the reasoning capabilities of image editing models, focusing on the structured knowledge acquisition process similar to human learning [2][3][16]. Group 1: KRIS-Bench Overview - KRIS-Bench is a collaborative effort involving multiple prestigious institutions aimed at assessing AI's reasoning abilities in image editing [2]. - The benchmark categorizes knowledge into three types: Factual Knowledge, Conceptual Knowledge, and Procedural Knowledge, allowing AI to face progressively complex editing challenges [4][8]. - It features 7 reasoning dimensions and 22 typical editing tasks, ranging from basic to advanced difficulty levels [6]. Group 2: Evaluation Metrics - KRIS-Bench introduces a four-dimensional automated evaluation system to score editing outputs: Visual Consistency, Visual Quality, Instruction Following, and Knowledge Plausibility [10][11][13]. - The evaluation process includes a total of 1,267 image-instruction pairs, meticulously curated by experts to ensure diverse data sources [12]. Group 3: Model Performance Insights - The benchmark tests 10 models, including 3 closed-source and 7 open-source models, revealing performance gaps particularly in procedural reasoning and natural science tasks [14][16]. - Closed-source models like GPT-Image-1 lead in performance, while open-source models like BAGEL-Think show improvements in knowledge plausibility through enhanced reasoning processes [17].
知识类型视角切入,全面评测图像编辑模型推理能力:所有模型在「程序性推理」方面表现不佳
量子位· 2025-06-13 05:07
Core Viewpoint - The article discusses the development of KRIS-Bench, a benchmark for evaluating the reasoning capabilities of image editing models, focusing on the structured knowledge acquisition process similar to human learning [2][3][4]. Group 1: Knowledge Structure - KRIS-Bench is designed to assess AI's knowledge structure through three categories: Factual Knowledge, Conceptual Knowledge, and Procedural Knowledge, allowing for a progressive challenge in image editing tasks [4][8]. - The benchmark includes 7 reasoning dimensions and 22 typical editing tasks, ranging from basic to advanced difficulty levels, covering a wide spectrum of challenges [6]. Group 2: Evaluation Metrics - KRIS-Bench introduces a four-dimensional automated evaluation system to score editing outputs, which includes Visual Consistency, Visual Quality, Instruction Following, and Knowledge Plausibility [11][13]. - The evaluation process involves a total of 1,267 image-instruction pairs, meticulously curated by an expert team to ensure diverse data sources and prevent model exploitation [12]. Group 3: Model Performance - The benchmark evaluates 10 models, including 3 closed-source and 7 open-source models, revealing that closed-source models like GPT-Image-1 outperform open-source counterparts in knowledge plausibility [14][18]. - Despite some models showing improvement in factual knowledge tasks, many still struggle with procedural reasoning and complex scientific tasks, indicating a significant gap in deep reasoning capabilities [18].