Workflow
ArtEdit
icon
Search documents
P图新手福音!智能修图Agent一句话精准调用200+专业工具,腾讯混元&厦大出品
量子位· 2025-12-26 04:24
Core Viewpoint - JarvisEvo, developed by Tencent and Xiamen University, is an advanced image editing AI that simulates human expert designers through iterative editing, visual perception, self-evaluation, and self-reflection, aiming to provide a more controllable and professional editing experience compared to traditional software and AI tools [1][3]. Group 1: Challenges in Image Editing - The article identifies two main challenges in achieving a professional-level editing experience: Instruction Hallucination, where existing models struggle to visualize intermediate results and often make factual errors, and Reward Hacking, where models exploit static reward systems to gain high scores without genuinely improving editing quality [4][5]. Group 2: JarvisEvo's Mechanisms - JarvisEvo introduces the iMCoT (Interleaved Multimodal Chain-of-Thought) mechanism, allowing the model to generate new images after each editing step and use visual feedback for subsequent reasoning, breaking the limitations of traditional blind editing [8][9]. - The SEPO (Synergistic Editor-Evaluator Policy Optimization) framework enables JarvisEvo to learn from mistakes by comparing low and high scoring trajectories, thus developing a strong self-correction ability [11][12]. Group 3: System Architecture - The system operates in a four-step process: visual perception and planning, step-by-step execution, self-evaluation, and self-reflection, ensuring precise execution of each operation [18][16]. - The model utilizes two optimization loops: the Editor Policy Optimization loop focuses on improving tool usage for better image quality, while the Evaluator Policy Optimization loop ensures the model's scoring aligns with human aesthetic standards [17][25]. Group 4: Training Framework - JarvisEvo's training consists of three stages: Cold-Start Supervised Fine-Tuning with 150K labeled samples to teach basic skills, SEPO Reinforcement Learning with 20K standard instruction data for autonomous exploration, and Reflection Fine-Tuning with 5K reflection samples to enhance self-correction capabilities [20][22][31]. Group 5: Experimental Results - In evaluations, JarvisEvo achieved a Spearman Rank Correlation Coefficient (SRCC) of 0.7243 and a Pearson Linear Correlation Coefficient (PLCC) of 0.7116, outperforming other models and demonstrating superior alignment with human preferences [36][38]. - The model showed a 44.96% improvement in L1 and L2 metrics compared to commercial models, maintaining original image details while excelling in style and detail presentation [34][40]. Group 6: Future Prospects - The collaborative evolution paradigm of JarvisEvo is expected to extend beyond image editing to areas such as mathematical reasoning, code generation, and long-term planning, with ongoing efforts to enhance its capabilities for complex tasks [44][45].