Workflow
文生图
icon
Search documents
ChatGPT引入PS 用一句话即可修图
Bei Jing Shang Bao· 2025-12-16 03:11
Group 1 - Adobe has launched integrations of Photoshop, Express, and Acrobat for ChatGPT users, allowing them to access these tools directly within the chatbot [1] - The integration aims to present Adobe's products to over 800 million active users of ChatGPT, enhancing user creativity through easy access [1] - Users can perform various editing tasks such as adjusting brightness, contrast, and saturation, as well as applying stylized effects directly in ChatGPT [1] Group 2 - The new "extension mode" in ChatGPT allows users to input commands for image editing, which Adobe Express will automatically generate drafts for, enabling real-time adjustments without re-entering commands [2] - Adobe emphasizes that its core generation capabilities are based on its proprietary Firefly models, ensuring that all generated content has commercial usage rights and copyright protection [2] - OpenAI's integration of third-party applications into ChatGPT is part of a broader strategy to position the platform as a digital service hub, with Adobe being one of the early adopters [2] Group 3 - OpenAI's GPT-4o has improved capabilities for image generation, allowing users to transform photos into artistic styles, which has gained significant popularity on social media [3] - The advancements in GPT-4o include better text integration, enhanced context understanding, and improved multi-object binding, making it suitable for various applications, including advertising [3] - The demand for AI-generated images highlights the importance of computational power, as OpenAI's GPUs faced challenges in meeting user needs for the new image generation features [3] Group 4 - The competitive landscape in image editing products shows minimal technological differences, with AI driving functional upgrades and user engagement being crucial for widespread adoption [4] - The success of new features relies not only on product maturity but also on effective marketing strategies that can spark user curiosity and encourage usage [4] Group 5 - The rise of AI is expected to enhance productivity in media applications, benefiting companies that produce quality content and those in digital marketing, e-commerce, and copyright protection sectors [5]
美团开源LongCat-Image模型,在文生图与图像编辑核心能力上逼近更大尺寸的头部模型
Xin Lang Cai Jing· 2025-12-08 07:24
具体来看,模型采用文生图与图像编辑同源的架构,结合渐进式学习策略,在客观基准测试中,其图像 编辑得分与中文渲染能力均领跑参评模型;在文生图任务上,GenEval与DPG-Bench的表现证明其相比 头部开源与闭源模型依然具备强竞争力。(智通财经记者 范佳来) 12月8日,智通财经记者获悉,美团LongCat团队近日宣布,开源其最新研发的LongCat-Image模型。该 模型通过高性能模型架构设计、系统性的训练策略和数据工程,以6B的参数规模,在文生图与图像编 辑核心能力上逼近更大尺寸的头部模型,为开发者与产业界提供"高性能、低门槛、全开放"选择,据介 绍,LongCat-Image的核心优势在于其架构设计与训练策略。 ...
BFL 创立一年估值 32.5 亿美金,AI 原生版 Dropbox 来了
投资实习所· 2025-12-02 05:12
Core Insights - The effectiveness of images and videos in user engagement is high, but their long-term usage depends on whether they can become tools that genuinely help businesses or users generate revenue [1] Group 1: AI Product Performance - OpenAI's Sora experienced significant initial user engagement but has seen a recent decline in usage, indicating challenges in sustaining interest for standalone products [2] - Elevenlabs, a voice AI company, reported revenue of $193 million over the past 12 months, with approximately 50% coming from enterprise clients like Cisco and Twilio, and a profit margin of around 60% [2] Group 2: Company Valuation and Strategy - Black Forest Labs (BFL), an AI image generation startup, achieved a valuation of $3.25 billion after raising $300 million in Series B funding, demonstrating rapid growth since its establishment in August 2024 [3][4] - BFL's founders are key contributors to the Stable Diffusion models, which have significantly influenced the open-source image generation community and proprietary models like DALL-E [4] - BFL's strategy focuses on positioning itself as a model company rather than a direct consumer product provider, collaborating with major companies like Adobe and Microsoft through API integrations [6] Group 3: Technological Innovation - BFL's core model, FLUX.2, is released with open weights, allowing researchers and developers to utilize and customize it freely, enhancing its technical influence [6] - The company's focus on "visual intelligence" aims to unify perception, generation, memory, and reasoning, distinguishing it from competitors that only focus on text-to-image models [6] Group 4: Emerging Products - A new AI-native version of Dropbox is in development, which aims to transform file operations from storage-centric to understanding-centric, indicating a shift in product strategy [6][7]
让AI作画自己纠错!随机丢模块就能提升生成质量,告别塑料感废片
量子位· 2025-08-23 05:06
Core Viewpoint - The article discusses the introduction of a new method called S²-Guidance, developed by a research team from Tsinghua University, Alibaba AMAP, and the Chinese Academy of Sciences, which enhances the quality and coherence of AI-generated images and videos through a self-correcting mechanism [1][4]. Group 1: Methodology and Mechanism - S²-Guidance utilizes a technique called Stochastic Block-Dropping to dynamically construct "weak" sub-networks, allowing the AI to self-correct during the generation process [3][10]. - The method addresses the limitations of Classifier-Free Guidance (CFG), which often leads to distortion and lacks generalizability due to its linear extrapolation nature [5][8]. - By avoiding the need for external weak models and complex parameter tuning, S²-Guidance offers a universal and automated solution for self-optimization [12][11]. Group 2: Performance Improvements - S²-Guidance significantly enhances visual quality across multiple dimensions, including temporal dynamics, detail rendering, and artifact reduction, compared to previous methods like CFG and Autoguidance [19][21]. - The method demonstrates superior performance in generating coherent and aesthetically pleasing images, effectively avoiding common issues such as unnatural artifacts and distorted objects [22][24]. - In video generation, S²-Guidance resolves key challenges related to physical realism and complex instruction adherence, producing stable and visually rich scenes [25][26]. Group 3: Experimental Validation - The research team validated the effectiveness of S²-Guidance through rigorous experiments, showing that it balances guidance strength with distribution fidelity, outperforming CFG in capturing true data distributions [14][18]. - S²-Guidance achieved leading scores on authoritative benchmarks like HPSv2.1 and T2I-CompBench, surpassing all comparative methods in various quality dimensions [26][27].
Qwen-Image 模型上线基石智算,快来体验超强文本渲染能力
Sou Hu Cai Jing· 2025-08-14 15:48
Core Insights - Qwen-Image, the first text-to-image foundational model from the Qwen series, has been launched by Qiyun Technology's AI computing cloud, CoresHub, featuring 20 billion parameters and developed by Alibaba's Tongyi Qianwen team [1] - The model excels in complex text rendering, precise image editing, multi-line layout, paragraph-level generation, and detail depiction, making it particularly effective in poster design scenarios [1] Model Highlights - Exceptional text rendering capabilities: Qwen-Image demonstrates outstanding performance in complex text generation and rendering, supporting multi-line typesetting, paragraph-level layout, and fine-grained detail presentation in both English and Chinese [2] - Consistency in image editing: Leveraging enhanced multi-task training paradigms, Qwen-Image can accurately modify target areas during image editing while maintaining overall visual consistency and semantic coherence [2] - Industry-leading performance: Multiple public benchmark test results indicate that Qwen-Image has achieved state-of-the-art (SOTA) results in various image generation and editing tasks, validating its comprehensive strength [2] Usage Steps - Users can log into CoresHub, navigate to the model plaza, select the Qwen-Image model, and click on model deployment [3] - The model can be deployed by selecting a single card 4090D resource type, and after successful deployment, users can copy the external link to open in a browser [4] - Once the Comfy UI page loads successfully, users can select the Qwen-Image template and input their prompt [6] Effect Demonstration - Various prompts showcase the capabilities of Qwen-Image, including imaginative scenarios such as a Shiba Inu wearing a cowboy hat at a bar, a cotton candy castle in the clouds, and a retro arcade with a pixel-style game machine [9][11][12][13]
2025年中国多模态大模型行业模型现状 图像、视频、音频、3D模型等终将打通和融合【组图】
Qian Zhan Wang· 2025-06-01 05:09
Core Insights - The exploration of multimodal large models is making gradual progress, with a focus on breakthroughs in visual modalities, aiming for an "Any-to-Any" model that requires successful pathways across various modalities [1] - The industry is currently concentrating on enhancing perception and generation models in image, video, and 3D modalities, with the goal of achieving cross-modal integration and sharing [1] Multimodal Large Models in Image - Prior to the rise of LLMs in 2023, the industry had already established a solid foundation in image understanding and generation, resulting in models like CLIP, Stable Diffusion, and GAN, which led to applications such as Midjourney and DALL·E [2] - The industry is actively exploring the integration of Transformer models into image-related tasks, with significant outcomes including GLIP, SAM, and GPT-V [2] Multimodal Large Models in Video - Video generation is being approached by transferring image generation models to video, utilizing image data for training and aligning temporal dimensions to achieve text-to-video results [5] - Recent advancements include models like VideoLDM and Sora, which demonstrate significant breakthroughs in video generation using the Diffusion Transformer architecture [5] Multimodal Large Models in 3D - The generation of 3D models is being explored by extending 2D image generation methods, with key models such as 3D GAN, MeshDiffusion, and Instant3D emerging in the industry [8][9] - 3D data representation includes various formats like meshes, point clouds, and NeRF, with NeRF being a critical technology for 3D data representation [9] Multimodal Large Models in Audio - AI technologies related to audio have matured, with recent applications of Transformer models enhancing audio understanding and generation, exemplified by projects like Whisper large-v3 and VALL-E [11] - The evolution of speech technology is categorized into three stages, with a focus on enhancing generalization capabilities across multiple languages and tasks [11]
鹅厂放大招,混元图像2.0「边说边画」:描述完,图也生成好了
量子位· 2025-05-16 03:39
Core Viewpoint - Tencent has launched the Hunyuan Image 2.0 model, which enables real-time image generation with millisecond response times, allowing users to create images seamlessly while describing them verbally or through sketches [1][6]. Group 1: Features of Hunyuan Image 2.0 - The model supports real-time drawing boards where users can sketch elements and provide text descriptions for immediate image generation [3][29]. - It offers various input methods, including text prompts, voice input in both Chinese and English, and the ability to upload reference images for enhanced image generation [18][19]. - Users can optimize generated images by adjusting parameters such as reference image strength and can also use a feature to automatically enhance composition and depth [27][35]. Group 2: Technical Highlights - Hunyuan Image 2.0 features a significantly larger model size, increasing parameters by an order of magnitude compared to its predecessor, Hunyuan DiT, which enhances performance [37]. - The model incorporates a high-compression image codec developed by Tencent, which reduces encoding sequence length and speeds up image generation while maintaining quality [38]. - It utilizes a multimodal large language model (MLLM) as a text encoder, improving semantic understanding and matching capabilities compared to traditional models [39][40]. - The model has undergone reinforcement learning training to enhance the realism of generated images, aligning them more closely with real-world requirements [41]. - Tencent has developed a self-research adversarial distillation scheme that allows for high-quality image generation with fewer steps [42]. Group 3: Future Developments - Tencent's team has indicated that more details will be shared in upcoming technical reports, including information about a native multimodal image generation model [43][45]. - The new model is expected to excel in multi-round image generation and real-time interactive experiences [46].
文生图开源模型黑马,来自合肥
AI研究所· 2025-05-09 17:44
Core Viewpoint - The article highlights the emergence of HiDream.ai, a Chinese company that has developed a competitive image generation model, HiDream-I1, which rivals OpenAI's GPT-4o in performance and capabilities, marking a significant advancement in the AI field [1][3][6]. Group 1: HiDream-I1 Model Performance - HiDream-I1 achieved a score of 1123 on the ArtificialAnalysis platform, ranking second globally, just behind GPT-4o with a score of 1139, indicating a mere 0.8% performance gap [3][6]. - In various benchmark tests, HiDream-I1 outperformed other models such as MidjourneyV6 and DALL-E3, showcasing its superior capabilities in complex prompt understanding and image quality [6]. - HiDream-I1 is the only open-source image generation model that allows commercial use, which has attracted significant attention from developers and companies globally [6][10]. Group 2: Team Background and Business Strategy - HiDream.ai was founded in March 2023, with a team primarily composed of members from the University of Science and Technology of China, led by founder Mei Tao, who has a strong background in AI research [8][9]. - The company is exploring sustainable business models while focusing on user pain points to develop optimized products and services [10]. - HiDream.ai has already implemented its technology in various applications, including a strategic partnership with Cambrian for cloud acceleration and a collaboration with China Mobile for AI video products [11]. Group 3: Local Ecosystem and Industry Growth - HiDream.ai's success is closely tied to the supportive ecosystem in Hefei, which integrates resources from universities, government, and enterprises, fostering rapid AI industry growth [14][15]. - Hefei aims to achieve an AI industry scale exceeding 200 billion yuan by 2025, with significant investments in computing power and infrastructure [16][21]. - The city has established itself as a national AI industrial base, with over 2,200 companies and a revenue exceeding 200 billion yuan in 2023, showcasing its competitive edge in the AI sector [16][21].
文生图进入R1时刻:港中文MMLab发布T2I-R1
机器之心· 2025-05-09 02:47
Core Viewpoint - The article discusses the development of T2I-R1, a novel text-to-image generation model that utilizes a dual-level Chain of Thought (CoT) reasoning framework combined with reinforcement learning to enhance image generation quality and alignment with human expectations [1][3][11]. Group 1: Methodology - T2I-R1 employs two distinct levels of CoT reasoning: Semantic-CoT and Token-CoT. Semantic-CoT focuses on the global structure of the image, while Token-CoT deals with the detailed generation of image tokens [6][7]. - The model integrates Semantic-CoT to plan and reason about the image before generation, optimizing the alignment between prompts and generated images [7][8]. - Token-CoT generates image tokens sequentially, ensuring visual coherence and detail in the generated images [7][8]. Group 2: Model Enhancement - T2I-R1 enhances a unified language and vision model (ULM) by incorporating both Semantic-CoT and Token-CoT into a single framework for text-to-image generation [9][11]. - The model uses reinforcement learning to jointly optimize the two levels of CoT, allowing for multiple sets of Semantic-CoT and Token-CoT to be generated for a single image prompt [11][12]. Group 3: Experimental Results - The T2I-R1 model demonstrates improved robustness and alignment with human expectations when generating images based on prompts, particularly in unusual scenarios [13]. - Quantitative results indicate that T2I-R1 outperforms baseline models by 13% and 19% on the T2I-CompBench and WISE benchmarks, respectively, and surpasses previous state-of-the-art models [16].
AI生成字体设计我有点玩明白了,用这套Prompt提效50%。
数字生命卡兹克· 2025-04-13 17:16
阿真摸索出来的非常酷的用即梦3.0生成文字的用法~转载给大家。 嗨大家好!周一上班愉快! 每天脑子里都有很多想法转瞬即逝,不赶紧记录下来就会懒到不想再实践,于是就应该好好记录下来! 今天也是一个很不错的干货, 这组提示词的作用是,你只需要输入你的文字内容,就可以得到还不 错的文字设计的视觉效果。 为了它的效果测试和呈现我几乎掏空了我的即梦AI,测试非常多组合和风格后确信效果确实是还不错 的。 今天简短一点,欢迎大家轻松收看图片,然后查收提示词模板进行尝试! 先放一些看起来还不错的图文效果: "艺术家看到的比你多在哪"/"WHERE DO ARTISTS SEE BEYOND YOU",抽象概念书艺融合留白解构 风格,文字边界轻微溶解如意识边缘,漂浮排布构成意识碎片之感,背景为空灵灰白与虚实交织纹理, 如精神空间裂隙,字体采用半透明层叠毛笔线条,笔触轻盈而残缺,形成超现实视觉留白,气质抽离冷 静,带哲思与冥想氛围,思维跃迁感强烈,极简哲性构图,艺术意识流杰作 "电竞少年"/"E-SPORTS YOUTH",电竞动力融合动感秀逸与科幻光切风格,字体结构尖锐俐落,线条 如电流般延伸,高亮描边与速度动效结合,背景为深 ...