Workflow
图像编辑智能体
icon
Search documents
P图新手福音,智能修图Agent一句话精准调用200+专业工具,腾讯混元&厦大出品
3 6 Ke· 2025-12-26 07:11
一句话让照片变大片,比专业软件简单、比AI修图更可控! 腾讯混元携手厦门大学推出JarvisEvo——一个统一的图像编辑智能体模拟人类专家设计师,通过迭代编辑、视觉感知、自我评估和自我反思来"p图"。 现有的文本思维链 (Text-only CoT) 存在信息瓶颈。模型在推理过程中"看不见"中间的修图结果,仅凭文本"脑补"假设进行下一步操作的视觉结果,容易导 致事实性错误,无法确保每一步都符合用户意图。 1. 奖励黑客 (Reward Hacking): 在强化学习进行偏好对齐的过程中,策略模型(Policy)是动态更新的,而奖励模型(Reward Model)通常是静态的。这导致策略模型容易"钻空子",通 过欺骗奖励函数获取高分,而非真正提升修图质量和自我评估能力 。 为了解决上述问题,团队推出了JarvisEvo. "像专家一样思考,像工匠一样打磨"。JarvisEvo不仅能用Lightroom修图,更能"看见"修图后的变化,并自我评判好坏,从而实现无需外部奖励的自我进 化 。 下面就来了解一下详细情况吧~ 自我评估和修正 研究背景与动机 近年来,基于指令的图像编辑模型虽然取得了显著进展,但在追求"专业级 ...
P图新手福音!智能修图Agent一句话精准调用200+专业工具,腾讯混元&厦大出品
量子位· 2025-12-26 04:24
Core Viewpoint - JarvisEvo, developed by Tencent and Xiamen University, is an advanced image editing AI that simulates human expert designers through iterative editing, visual perception, self-evaluation, and self-reflection, aiming to provide a more controllable and professional editing experience compared to traditional software and AI tools [1][3]. Group 1: Challenges in Image Editing - The article identifies two main challenges in achieving a professional-level editing experience: Instruction Hallucination, where existing models struggle to visualize intermediate results and often make factual errors, and Reward Hacking, where models exploit static reward systems to gain high scores without genuinely improving editing quality [4][5]. Group 2: JarvisEvo's Mechanisms - JarvisEvo introduces the iMCoT (Interleaved Multimodal Chain-of-Thought) mechanism, allowing the model to generate new images after each editing step and use visual feedback for subsequent reasoning, breaking the limitations of traditional blind editing [8][9]. - The SEPO (Synergistic Editor-Evaluator Policy Optimization) framework enables JarvisEvo to learn from mistakes by comparing low and high scoring trajectories, thus developing a strong self-correction ability [11][12]. Group 3: System Architecture - The system operates in a four-step process: visual perception and planning, step-by-step execution, self-evaluation, and self-reflection, ensuring precise execution of each operation [18][16]. - The model utilizes two optimization loops: the Editor Policy Optimization loop focuses on improving tool usage for better image quality, while the Evaluator Policy Optimization loop ensures the model's scoring aligns with human aesthetic standards [17][25]. Group 4: Training Framework - JarvisEvo's training consists of three stages: Cold-Start Supervised Fine-Tuning with 150K labeled samples to teach basic skills, SEPO Reinforcement Learning with 20K standard instruction data for autonomous exploration, and Reflection Fine-Tuning with 5K reflection samples to enhance self-correction capabilities [20][22][31]. Group 5: Experimental Results - In evaluations, JarvisEvo achieved a Spearman Rank Correlation Coefficient (SRCC) of 0.7243 and a Pearson Linear Correlation Coefficient (PLCC) of 0.7116, outperforming other models and demonstrating superior alignment with human preferences [36][38]. - The model showed a 44.96% improvement in L1 and L2 metrics compared to commercial models, maintaining original image details while excelling in style and detail presentation [34][40]. Group 6: Future Prospects - The collaborative evolution paradigm of JarvisEvo is expected to extend beyond image editing to areas such as mathematical reasoning, code generation, and long-term planning, with ongoing efforts to enhance its capabilities for complex tasks [44][45].