Workflow
OmniGen
icon
Search documents
智源开源EditScore:为图像编辑解锁在线强化学习的无限可能
机器之心· 2025-10-22 03:30
随着多模态大模型的不断演进,指令引导的图像编辑(Instruction-guided Image Editing)技术取得了显著进展。然而,现有模型在遵循复杂、精细的文本指令方面 仍面临巨大挑战,往往需要用户进行多次尝试和手动筛选,难以实现稳定、高质量的「一步到位」式编辑。 强化学习(RL)为模型实现自我演进、提升指令遵循能力提供了一条极具潜力的路径。但其在图像编辑领域的应用,长期以来受限于一个核心瓶颈: 缺乏一个能 够精确评估编辑质量并提供高保真度反馈的奖励模型(Reward Model)。 没有可靠的「奖励信号」,模型便无法有效判断自身生成结果的优劣,从而难以实现高 效的自我优化。 为攻克这一难题, 北京智源人工智能研究院 VectorSpace Lab 团队 近日发布了全新的高保真奖励模型系列—— EditScore 。该工作直面上述挑战,旨在 为指令引 导的图像编辑任务提供精确、可靠的奖励信号,从而为强化学习在 AIGC 领域的深入应用铺平道路,真正解锁其强大潜力。 EditScore 是智源在成功推出统一图像生成模型 OmniGen 系列之后,对更通用、更可控的生成式 AI 的又一重要探索。为了促进 ...
国产SOTA新模型精准get“画(3+6)条命的动物” | 开源
量子位· 2025-06-20 03:28
Core Viewpoint - The article discusses the advancements in AI, particularly focusing on the new model MindOmni, which enhances reasoning and generative capabilities in image generation, allowing for more coherent and logical outputs based on complex instructions [7][9][44]. Group 1: MindOmni Model Overview - MindOmni is a collaborative effort by Tsinghua University, Tencent ARC Lab, and other institutions, designed to improve AI's reasoning generation capabilities [7]. - The model integrates visual understanding and generative abilities, utilizing a structure based on Qwen2.5-VL, a visual language model [14][18]. - The core module for image generation is the diffusion decoder, which transforms noise into realistic images through a denoising process [15][16]. Group 2: Training Phases - The training of MindOmni occurs in three phases: basic pre-training, supervised fine-tuning, and reasoning generation strategy optimization (RGPO) [19][32]. - In the pre-training phase, the model learns basic text-to-image generation capabilities using open-source image-text pairs [20]. - The RGPO phase employs reinforcement learning to enhance the model's ability to generate logical reasoning chains [26][29]. Group 3: Performance Metrics - MindOmni has shown superior performance in various multimodal understanding and generation benchmarks, outperforming previous models [36][38]. - In image understanding tasks, MindOmni improved by 10.6% on MMMU and 9.8% on MMBench compared to earlier models [38][39]. - The model achieved an overall score of 83% in the GenEval benchmark, demonstrating its strong generative capabilities [40]. Group 4: Reasoning Generation Capabilities - MindOmni excels in reasoning generation tasks, achieving a score of 0.71 in the WISE benchmark across multiple subcategories [45]. - The model effectively interprets complex prompts, such as generating images based on mathematical expressions, showcasing its reasoning abilities [46][47]. - MindOmni's performance in multimodal input scenarios further highlights its advanced capabilities in understanding and generating relevant outputs [48]. Group 5: Ablation Studies - Extensive ablation studies confirm the significance of each training phase in enhancing the model's performance [50]. - The pre-training phase establishes basic generation abilities, while supervised fine-tuning and RGPO further refine reasoning generation capabilities [50][51].
知识类型视角切入,全面评测图像编辑模型推理能力:所有模型在「程序性推理」方面表现不佳
量子位· 2025-06-13 05:07
Core Viewpoint - The article discusses the development of KRIS-Bench, a benchmark for evaluating the reasoning capabilities of image editing models, focusing on the structured knowledge acquisition process similar to human learning [2][3][16]. Group 1: KRIS-Bench Overview - KRIS-Bench is a collaborative effort involving multiple prestigious institutions aimed at assessing AI's reasoning abilities in image editing [2]. - The benchmark categorizes knowledge into three types: Factual Knowledge, Conceptual Knowledge, and Procedural Knowledge, allowing AI to face progressively complex editing challenges [4][8]. - It features 7 reasoning dimensions and 22 typical editing tasks, ranging from basic to advanced difficulty levels [6]. Group 2: Evaluation Metrics - KRIS-Bench introduces a four-dimensional automated evaluation system to score editing outputs: Visual Consistency, Visual Quality, Instruction Following, and Knowledge Plausibility [10][11][13]. - The evaluation process includes a total of 1,267 image-instruction pairs, meticulously curated by experts to ensure diverse data sources [12]. Group 3: Model Performance Insights - The benchmark tests 10 models, including 3 closed-source and 7 open-source models, revealing performance gaps particularly in procedural reasoning and natural science tasks [14][16]. - Closed-source models like GPT-Image-1 lead in performance, while open-source models like BAGEL-Think show improvements in knowledge plausibility through enhanced reasoning processes [17].
知识类型视角切入,全面评测图像编辑模型推理能力:所有模型在「程序性推理」方面表现不佳
量子位· 2025-06-13 05:07
Core Viewpoint - The article discusses the development of KRIS-Bench, a benchmark for evaluating the reasoning capabilities of image editing models, focusing on the structured knowledge acquisition process similar to human learning [2][3][4]. Group 1: Knowledge Structure - KRIS-Bench is designed to assess AI's knowledge structure through three categories: Factual Knowledge, Conceptual Knowledge, and Procedural Knowledge, allowing for a progressive challenge in image editing tasks [4][8]. - The benchmark includes 7 reasoning dimensions and 22 typical editing tasks, ranging from basic to advanced difficulty levels, covering a wide spectrum of challenges [6]. Group 2: Evaluation Metrics - KRIS-Bench introduces a four-dimensional automated evaluation system to score editing outputs, which includes Visual Consistency, Visual Quality, Instruction Following, and Knowledge Plausibility [11][13]. - The evaluation process involves a total of 1,267 image-instruction pairs, meticulously curated by an expert team to ensure diverse data sources and prevent model exploitation [12]. Group 3: Model Performance - The benchmark evaluates 10 models, including 3 closed-source and 7 open-source models, revealing that closed-source models like GPT-Image-1 outperform open-source counterparts in knowledge plausibility [14][18]. - Despite some models showing improvement in factual knowledge tasks, many still struggle with procedural reasoning and complex scientific tasks, indicating a significant gap in deep reasoning capabilities [18].
智源发布“悟界”系列大模型,含全球首个原生多模态世界模型Emu3
Feng Huang Wang· 2025-06-06 14:32
Core Insights - The Zhiyuan Research Institute launched the "Wujie" series of large models, including Emu3, Brainμ, RoboOS 2.0, RoboBrain 2.0, and OpenComplex2, at the 2025 Beijing Zhiyuan Conference [1] Group 1: Emu3 and Brainμ Models - Emu3 is a native multimodal world model that utilizes a next-token prediction paradigm for unified multimodal learning, allowing for the encoding of images/videos into discrete symbol sequences [2] - Brainμ, built on the Emu3 architecture, integrates brain signals as a new modality, enabling a single model to perform various neuroscience tasks, potentially becoming the "AlphaFold" of brain science [2][3] Group 2: RoboOS 2.0 and RoboBrain 2.0 - RoboOS 2.0 is the world's first open-source framework for embodied intelligence SaaS platforms, significantly reducing development barriers and improving performance by 30% compared to its predecessor [4] - RoboBrain 2.0 enhances multi-agent task planning capabilities, achieving a 74% improvement in task planning accuracy over RoboBrain 1.0 [5] Group 3: OpenComplex2 Model - OpenComplex2 represents a breakthrough in modeling biological molecules, capturing molecular interactions at atomic resolution and providing insights into the relationship between microscopic fluctuations and macroscopic biological functions [6][7] Group 4: Open Source Initiatives - Zhiyuan has open-sourced approximately 200 models and 160 datasets, with the FlagOS software stack upgraded to support various AI hardware and improve performance by up to 23% [8] Group 5: Applications and Collaborations - The Brainμ model has shown potential in consumer-grade brain-computer interface applications, collaborating with leading neuroscience laboratories and companies to expand its industrial applications [3][11] - The development of a digital twin heart and a drug safety evaluation platform demonstrates the application of advanced modeling techniques in pharmacology and personalized medicine [12]