图像编辑
Search documents
美团开源LongCat-Image模型,在文生图与图像编辑核心能力上逼近更大尺寸的头部模型
Xin Lang Cai Jing· 2025-12-08 07:24
具体来看,模型采用文生图与图像编辑同源的架构,结合渐进式学习策略,在客观基准测试中,其图像 编辑得分与中文渲染能力均领跑参评模型;在文生图任务上,GenEval与DPG-Bench的表现证明其相比 头部开源与闭源模型依然具备强竞争力。(智通财经记者 范佳来) 12月8日,智通财经记者获悉,美团LongCat团队近日宣布,开源其最新研发的LongCat-Image模型。该 模型通过高性能模型架构设计、系统性的训练策略和数据工程,以6B的参数规模,在文生图与图像编 辑核心能力上逼近更大尺寸的头部模型,为开发者与产业界提供"高性能、低门槛、全开放"选择,据介 绍,LongCat-Image的核心优势在于其架构设计与训练策略。 ...
图像编辑缺训练数据?直接从视频中取材,仅用1%训练数据实现近SOTA效果
量子位· 2025-12-06 03:21
Core Viewpoint - The article presents a novel approach to image editing by redefining it as a degenerate temporal process, leveraging video data to enhance the efficiency and effectiveness of image editing tasks [1][4]. Group 1: Current Challenges in Image Editing - Existing image editing methods based on diffusion models require large-scale, high-quality triplet data (instruction-source image-edited image), which is costly and fails to cover diverse user editing intents [3]. - There is a fundamental trade-off between structure preservation and texture modification; emphasizing one can limit flexibility while pursuing significant semantic changes may lead to geometric distortions [3]. Group 2: Video4Edit Approach - The Video4Edit project team redefines image editing as a special degenerate form of video generation, allowing for the extraction of knowledge from video data [4]. - By modeling the image editing task as a two-frame video generation process, the source image is treated as frame 0 and the edited image as frame 1, enabling the model to learn from video data and improve data efficiency [6][9]. Group 3: Knowledge Transfer and Efficiency - The use of single-frame evolution prior from video pre-trained models allows for the natural incorporation of structural preservation and semantic change balance mechanisms [7]. - The model learns to align editing intents rather than starting from scratch, leading to efficient parameter reuse and improved data efficiency [12]. Group 4: Data Efficiency Analysis - Introducing video priors significantly reduces the entropy of the hypothesis space, enhancing effective generalization capabilities [15]. - The temporal evolution-based fine-tuning offers higher sample efficiency, explaining why only about 1% of supervised data is needed to achieve convergence [16]. Group 5: Performance Evaluation - Video4Edit has been systematically evaluated across various image editing tasks, including style transfer, object replacement, and attribute modification, demonstrating its ability to accurately capture target style features while preserving source image structure [17]. - The model achieves comparable or superior performance to baseline methods using only 1% of the supervised data, indicating a significant reduction in dependency on labeled data while maintaining high-quality editing results [21][23].
RAE的终极形态?北大&阿里提出UniLIP: 将CLIP拓展到重建、生成和编辑
机器之心· 2025-11-02 08:01
Core Insights - The article discusses the innovative UniLIP model, which addresses the trade-off between semantic understanding and pixel detail retention in unified multimodal models [2][4][32] - UniLIP achieves state-of-the-art (SOTA) performance in various benchmarks while maintaining or slightly improving understanding capabilities compared to larger models [5][26] Methodology - UniLIP employs a two-stage training framework with self-distillation loss to enhance image reconstruction capabilities without sacrificing original understanding performance [4][11] - The first stage involves aligning the decoder while freezing the CLIP model, focusing on learning to reconstruct images from fixed CLIP features [9][11] - The second stage jointly trains CLIP and applies self-distillation to ensure feature consistency while injecting pixel details [11][12] Performance Metrics - UniLIP models (1B and 3B parameters) achieved SOTA results in benchmarks such as GenEval (0.90), WISE (0.63), and ImgEdit (3.94) [5][26][27] - In image reconstruction, UniLIP outperformed previous quantization methods and demonstrated significant advantages in generation efficiency [22][24] Architectural Design - The architecture of UniLIP integrates InternVL3 and SANA, utilizing InternViT as the CLIP encoder and a pixel decoder from DC-AE [20] - The model is designed with a connector structure that maintains consistency with large language models (LLMs) [20] Training Data - UniLIP's training data includes 38 million pre-training samples and 60,000 instruction fine-tuning samples for generation, along with 1.5 million editing samples [21] Image Generation and Editing - UniLIP excels in both image generation and editing tasks, achieving high scores in benchmarks due to its rich feature representation and precise semantic alignment [26][27][30] - The dual-condition architecture effectively connects MLLM with diffusion models, ensuring high fidelity and consistency in generated and edited images [18][32]
在夹缝中生存12年,他终于打造了国产AI活跃用户数第一的产品|WAVES
3 6 Ke· 2025-10-30 17:47
Core Insights - Fotor, an AI product founded by Duan Jiang, has over 10 million monthly active users and is a leading AI application in China, despite being based in Chengdu rather than major tech hubs [1][2] - The company transitioned from a simple image editing software to a profitable AI-driven platform, achieving a sevenfold increase in user scale and profitability after launching its text-to-image tool [1][4] - Fotor's journey reflects a non-typical entrepreneurial path, emphasizing the importance of perseverance and seizing opportunities when they arise [2][3] Company Development - Fotor was initially focused on the mobile internet market but shifted its strategy to overseas markets due to intense competition and funding challenges [2][5] - The company faced significant hurdles, including a lack of funding and the need to pivot to a paid model after exhausting initial financing [5][6] - Fotor's decision to focus on the PC market and SEO for customer acquisition proved beneficial, leading to a substantial increase in user engagement and revenue [5][6] Product Evolution - The launch of Fotor's text-to-image tool was a strategic response to the success of competitors like Midjourney, allowing the company to capitalize on a growing trend in AI image generation [3][4] - Fotor has expanded its offerings to include video generation, although initial attempts have been met with mixed results, leading to a focus on workflow improvements instead [8][9] - The company aims to combine traditional image tools with AI capabilities, positioning itself as a versatile product company in the AI landscape [9] Market Position - Fotor has established a strong presence in English-speaking markets, with the U.S., U.K., Canada, Australia, and New Zealand contributing significantly to its revenue [6] - The company has opted to decline investment offers, citing its current profitability and the need to find a clear direction for large-scale investments [7][8] - Fotor's user base is diverse, catering to both professional and casual users, which has been a key factor in its sustained growth [9]
刚刚,谷歌放出Nano Banana六大正宗Prompt玩法,手残党速来
机器之心· 2025-09-03 08:33
Core Viewpoint - Google’s Nano Banana has gained popularity among users for its creative applications in generating images from text prompts, showcasing the model's versatility and potential in various creative fields [2][8]. Group 1: Image Generation Techniques - Users can create photorealistic images by providing detailed prompts that include camera angles, lighting, and environmental descriptions, which guide the model to produce realistic effects [12][13]. - The model allows for text-to-image generation, image editing through text prompts, multi-image composition, iterative optimization, and text rendering for clear visual communication [16]. - Specific templates for different styles, such as stickers, logos, product photography, minimalist designs, and comic panels, are provided to help users effectively utilize the model [18][21][25][30][34]. Group 2: User Experience and Challenges - Despite its capabilities, users have reported challenges with the model, such as returning identical images during editing and inconsistencies compared to other models like Qwen and Kontext Pro [39]. - Users are encouraged to share their unique insights and techniques for using Nano Banana in the comments section, fostering a community of knowledge sharing [40].
凌晨战神Qwen又搞事情!新模型让图像编辑“哪里不对改哪里”
量子位· 2025-08-19 07:21
Core Viewpoint - Qwen-Image-Edit is a powerful image editing tool that allows users to perform precise edits, including adding, removing, and modifying elements in images, while maintaining visual semantics and supporting various creative functionalities [2][67]. Group 1: Features and Capabilities - Qwen-Image-Edit offers a range of functionalities such as original IP editing, perspective switching, and virtual character generation, showcasing its versatility in image manipulation [2][20][67]. - The tool supports semantic editing, allowing modifications to images while preserving their original visual semantics, which is crucial for maintaining character integrity in IP creation [7][10]. - Users can perform perspective transformations, including 90-degree and 180-degree rotations, demonstrating the tool's capability to handle complex visual adjustments [14][19]. Group 2: Performance and Testing - Initial tests indicate that Qwen-Image-Edit produces impressive results, with accurate representations of elements and details, such as maintaining the correct number of fingers in character designs [13][19]. - The tool effectively adds elements to images, such as signs, while also managing reflections and maintaining detail, although high-resolution images may lead to some loss of quality [29][34]. - The AI's ability to remove and recolor elements within images has been validated through practical examples, showcasing its precision in editing tasks [39][42][45]. Group 3: Advanced Editing Techniques - Qwen-Image-Edit introduces a chain editing feature, allowing users to make incremental corrections to images without needing to regenerate the entire picture, enhancing efficiency in the editing process [56][62]. - The tool's dual editing capabilities encompass both low-level visual appearance edits and high-level semantic edits, catering to a wide range of image editing needs [67]. Group 4: Market Position and Performance Metrics - Qwen-Image-Edit has demonstrated state-of-the-art (SOTA) performance in various public benchmark tests, establishing itself as a robust foundational model for image editing tasks [67].
Qwen新开源,把AI生图里的文字SOTA拉爆了
量子位· 2025-08-05 01:40
Core Viewpoint - The article discusses the release of Qwen-Image, a 20 billion parameter image generation model that excels in complex text rendering and image editing capabilities [3][28]. Group 1: Model Features - Qwen-Image is the first foundational image generation model in the Tongyi Qianwen series, utilizing the MMDiT architecture [4][3]. - It demonstrates exceptional performance in complex text rendering, supporting multi-line layouts and fine-grained detail presentation in both English and Chinese [28][32]. - The model also possesses consistent image editing capabilities, allowing for style transfer, modifications, detail enhancement, text editing, and pose adjustments [27][28]. Group 2: Performance Evaluation - Qwen-Image has achieved state-of-the-art (SOTA) performance across various public benchmark tests, including GenEval, DPG, OneIG-Bench for image generation, and GEdit, ImgEdit, GSO for image editing [29][30]. - In particular, it has shown significant superiority in Chinese text rendering compared to existing advanced models [33]. Group 3: Training Strategy - The model employs a progressive training strategy that transitions from non-text to text rendering, gradually moving from simple to complex text inputs, which enhances its native text rendering capabilities [34]. Group 4: Practical Applications - The article includes practical demonstrations of Qwen-Image's capabilities, such as generating illustrations, PPTs, and promotional images, showcasing its ability to accurately integrate text with visuals [11][21][24].
图像界的DeepSeek!12B参数对标GPT-4o,5秒出图,消费级硬件就能玩转编辑生成
量子位· 2025-06-30 00:38
Core Viewpoint - Black Forest Labs has announced the open-source release of its flagship image model FLUX.1 Kontext[dev], designed for image editing and capable of running on consumer-grade chips [1][23]. Group 1: Model Features - FLUX.1 Kontext[dev] has 12 billion parameters, offering faster inference and performance comparable to closed-source models like GPT-image-1 [2][36]. - The model allows for direct changes to existing images based on editing instructions, enabling precise local and global edits without any fine-tuning [6][36]. - Users can optimize images through multiple consecutive edits while minimizing visual drift [6][36]. - The model is optimized for NVIDIA Blackwell architecture, enhancing performance [6][39]. Group 2: Performance and Efficiency - FLUX.1 Kontext[dev] has been validated against a benchmark called KontextBench, which includes 1,026 image-prompt pairs across various editing tasks, showing superior performance compared to existing models [37]. - The model's inference speed has improved by 4 to 5 times compared to previous versions, typically completing tasks within 5 seconds on NVIDIA H100 GPUs, with operational costs around $0.0067 per run [41]. - Users have reported longer iteration times on MacBook Pro chips, taking about 1 minute per iteration [41]. Group 3: User Engagement and Accessibility - The official API for FLUX.1 Kontext[dev] is open for public testing, allowing users to upload images and experiment with the model [19]. - The model's open weights and variants are available, enabling users to adjust speed, efficiency, and quality based on their hardware capabilities [41].
字节开源图像编辑黑科技!1/30参数1/13数据,性能提升9.19%
量子位· 2025-05-07 09:33
Core Viewpoint - ByteDance has developed a new image editing method that improves performance by 9.19% compared to the current state-of-the-art (SOTA) methods, utilizing only 1/30 of the training data and 1/13 of the model parameter size [1]. Group 1: Methodology and Innovation - The new method does not require additional pre-training tasks or architectural modifications, relying instead on powerful multimodal models like GPT-4o to correct editing instructions [2]. - This approach addresses the issue of noisy supervisory signals in existing image editing models by constructing more effective editing instructions to enhance editing outcomes [3][9]. - The data and model have been made open-source on GitHub [4]. Group 2: Challenges in AI Image Editing - AI models often misinterpret instructions, such as changing the color of a boy's tie, which can lead to unintended alterations in skin tone or clothing [6]. - The team identified that existing image editing datasets contain a significant amount of noisy supervisory signals due to the automated methods used for dataset construction, leading to mismatches between instructions and image pairs [10][11][12]. Group 3: Training and Supervision - SuperEdit focuses on improving the quality of supervisory signals rather than merely increasing parameter size or pre-training computational power [13]. - The team utilized GPT-4o to generate more accurate editing instructions by observing differences between original and edited images [17]. - A comparative supervision mechanism was established to ensure the model learns the subtle differences between correct and incorrect editing instructions, enhancing its ability to understand and execute commands [22][23]. Group 4: Performance Metrics - SuperEdit demonstrated outstanding performance in multiple benchmark tests, achieving an overall accuracy of 69.7% and a score of 3.91 in the Real-Edit benchmark, surpassing the previous SOTA method SmartEdit, which had an accuracy of 58.3% and a score of 3.59 [25][28]. - The model was trained using a triplet loss function to distinguish between correct and incorrect editing instructions [27]. Group 5: Future Directions - The team plans to expand this data-prioritized approach to more visual generation tasks and explore possibilities of combining it with larger models [31].
美图公司AI视觉领域竞争力升级:七项图像编辑成果出炉
Zheng Quan Ri Bao· 2025-04-09 08:40
Core Insights - Meitu's MT Lab has achieved significant recognition with five research outcomes selected for the prestigious CVPR 2025 conference, which received over 13,000 submissions and has a low acceptance rate of 22.1% [2] - The lab also had two projects accepted at the AAAI 2025 conference, which had an acceptance rate of 23.4% from 12,957 submissions [2] - The seven research outcomes focus on image editing, including three generative AI technologies, three segmentation technologies, and one 3D reconstruction technology [2] Generative AI Technologies - GlyphMastero has been implemented in Meitu's app Meitu Xiuxiu, providing users with a seamless text modification experience [3] - MTADiffusion is integrated into Meitu's AI material generator WHEE, allowing for efficient image editing with simple commands [3] - StyO is utilized in Meitu Xiuxiu's AI creative and beauty camera features, enabling users to explore different dimensions easily [4] Segmentation and 3D Reconstruction Technologies - The segmentation breakthroughs include interactive segmentation and cutout technologies, which are applied in e-commerce design, image editing, and portrait beautification [4] - EVPGS represents advancements in 3D reconstruction, with increasing demand in new perspective generation, augmented reality (AR), 3D content generation, and virtual digital humans [4] Industry Position and Future Potential - Meitu's long-term investment in AI capabilities has allowed the company to integrate cutting-edge technologies into practical applications, enhancing its competitive edge in the core visual field [4] - The continuous iteration of product capabilities has led to increased user engagement and willingness to pay, indicating promising growth potential and expansion opportunities for the company [4]