Workflow
自回归模型
icon
Search documents
CVPR 2025 Highlight | 提升自回归模型样例学习能力,Few-shot图像编辑新范式开源
机器之心· 2025-06-01 03:30
Core Viewpoint - The article discusses the development of a new autoregressive model called InstaManip, which enhances in-context learning capabilities to better address the challenges of few-shot image editing [26]. Summary by Sections Introduction - Recent advancements in diffusion models have significantly improved text-guided image editing algorithms, but performance declines when user requests are difficult to describe or deviate from the training data distribution [1][2]. Problem Statement - The challenge arises when users want to edit images in ways that are not well-represented in the training dataset, such as transforming a regular car into a Lamborghini, which is hard to describe accurately with words alone [1]. Proposed Solution - To tackle this issue, the article suggests providing additional image examples alongside text instructions, allowing the model to learn desired transformations through few-shot image editing [2]. Model Structure and Methodology - The InstaManip model employs a novel group self-attention mechanism to learn image transformation features from both text and image examples, enabling it to edit new input images accordingly [6][15]. Learning Mechanism - The learning process is divided into two stages: the learning phase, where transferable knowledge is abstracted from examples, and the application phase, where this knowledge is applied to new scenarios [10][11]. Group Self-Attention Mechanism - The model incorporates multiple layers of group self-attention, which allows it to process text instructions and example images separately, enhancing the learning and application phases [16]. Relation Regularization - To mitigate noise from example images that could mislead the model, a relation regularization technique is introduced, aligning the learned similarities with those derived from text instructions [17]. Experimental Results - InstaManip outperforms previous models in both in-distribution and out-of-distribution settings, establishing itself as the state-of-the-art method for few-shot image editing [19][20]. Ablation Studies - Ablation experiments demonstrate that both the group self-attention mechanism and relation regularization significantly enhance model performance, confirming the necessity of each component [21][22]. Conclusion - The InstaManip model achieves superior results across multiple metrics and can further improve with an increased number of diverse example images [26].
扩散语言模型九倍推理加速!上海交大:KV Cache并非自回归模型的专属技巧
量子位· 2025-05-27 03:53
Core Viewpoint - The article discusses the introduction of dLLM-Cache, a novel inference caching mechanism for diffusion-based Large Language Models (dLLMs), which significantly enhances inference speed without compromising output quality [2][3][21]. Research Motivation - Diffusion-based language models are emerging as a significant paradigm in language generation, showcasing superior capabilities in tasks like "reversing the curse" and mathematical reasoning compared to autoregressive models (ARMs) [8][10]. - However, the inference process of dLLMs typically requires hundreds of denoising steps, leading to substantial computational costs and inefficiencies, particularly since existing acceleration methods like KV Cache are incompatible with the bidirectional attention architecture of dLLMs [10][11]. Method Overview - The authors identified that features of prompt tokens remain stable during the denoising process, allowing for their reuse, while only a small portion of response tokens exhibit significant changes [13][14]. - The V-verify mechanism was introduced to efficiently identify which response tokens require updates based on the changes in their underlying features, achieving a reduction of up to 75% in redundant computations [16][17][20]. Experimental Results - The effectiveness of dLLM-Cache was rigorously tested on LLaDA 8B and Dream 7B models across various benchmark tasks, demonstrating over 5 times acceleration in inference speed while maintaining or slightly improving model performance [21][25]. - In specific tasks like HotpotQA, dLLM-Cache achieved a remarkable 9.1 times speedup without loss of quality, showcasing its robust performance across different dLLM architectures [21][28]. General Applicability - The dLLM-Cache method was successfully applied to different dLLM architectures, confirming its versatility and effectiveness in enhancing inference efficiency across various models [25][28].
舍弃自回归!国内团队打造纯扩散多模态大模型LLaDA-V,理解任务新SOTA
机器之心· 2025-05-27 03:23
Core Viewpoint - The article discusses the development of LLaDA-V, a pure diffusion multimodal large language model (MLLM) that integrates visual instruction tuning, marking a significant breakthrough in multimodal understanding compared to traditional autoregressive methods [1][16]. Group 1: Model Development - The research team expanded LLaDA into the multimodal domain, introducing LLaDA-V, which utilizes a visual encoder (SigLIP 2) and an MLP connector to project visual features into the language embedding space, achieving effective multimodal alignment [2]. - LLaDA-V employs a discrete diffusion mechanism during both training and sampling phases, moving away from the autoregressive paradigm [2]. Group 2: Performance Highlights - LLaDA-V demonstrates strong data scalability and competitive performance, outperforming the autoregressive baseline LLaMA3-V in 11 multimodal tasks, despite LLaDA-8B being slightly inferior to LLaMA3-8B in pure text tasks [5]. - The model achieves state-of-the-art (SOTA) performance in multimodal understanding tasks compared to existing mixed autoregressive-diffusion models, validating the effectiveness of the MLLM architecture based on powerful language diffusion models [8]. - LLaDA-V significantly narrows the performance gap with top autoregressive MLLMs, achieving comparable results in benchmarks like MMStar [10]. Group 3: Core Methodology - The core of LLaDA-V lies in combining visual instruction tuning with LLaDA's masking diffusion mechanism, allowing for a robust training and inference process [13][15]. - The architecture consists of a classic "visual encoder + MLP projector + language model" setup, where the visual encoder extracts image features, and the MLP projector maps them to LLaDA's embedding space [15]. - LLaDA-V's training objective supports multi-turn multimodal dialogue by masking only the model's responses during training, optimizing the model's ability to generate coherent replies [15]. Group 4: Future Outlook - The successful integration of visual instruction tuning with masking diffusion models opens a new technical pathway for MLLM development, challenging the notion that multimodal intelligence must rely on autoregressive models [16]. - The ongoing advancement of language diffusion models is expected to play a more significant role in the future, further pushing the boundaries of multimodal AI [16].
12秒生成1万token!谷歌推出文本「扩散模型」Gemini Diffusion,研究员:演示都得降速看
量子位· 2025-05-21 10:39
Core Viewpoint - Google DeepMind has introduced Gemini Diffusion, a new language model that utilizes diffusion technology to significantly enhance text generation speed and quality compared to traditional autoregressive models [1][4][9]. Group 1: Technology and Performance - Gemini Diffusion can generate text at a speed of 2000 tokens per second, which is faster than the previous model, Gemini 2.0 Flash-Lite [7][11]. - The model employs a unique approach of refining noise to learn output generation, allowing for rapid iterations and error correction during the generation process [6][10][15]. - Unlike traditional models that generate one token at a time, Gemini Diffusion can generate entire blocks of tokens simultaneously, resulting in more coherent responses [14][9]. Group 2: Benchmarking and Comparisons - Benchmark tests show that Gemini Diffusion performs comparably to larger models, with specific metrics indicating it outperforms Gemini 2.0 Flash-Lite in several coding tasks [8]. - For example, in the HumanEval benchmark, Gemini Diffusion achieved a score of 76.0%, slightly higher than Gemini 2.0 Flash-Lite's 75.8% [8]. Group 3: Implications and Future Directions - The introduction of diffusion technology in language models suggests a potential shift towards more hybrid models in the future, as seen in similar research by other institutions [19][20]. - The ability to perform non-causal reasoning during text generation opens up new possibilities for complex problem-solving tasks that traditional autoregressive models struggle with [16][17].
阶跃星辰开源图像编辑模型Step1X-Edit;阿里巴巴AI旗舰应用夸克发布全新“AI相机”丨AIGC日报
创业邦· 2025-04-27 23:48
扫码订阅 AIGC 产业日报, 3.【Meta Token-Shuffle登场:自回归模型突破瓶颈,可AI生成 2048×2048 分辨率图像】报道称Meta AI创 新推出Token-Shuffle,目标解决自回归(Autoregressive,AR)模型在生成高分辨率图像方面的扩展难 题。在语言生成方面,自回归模型大放异彩,近年来也被广泛探索用于图像合成,然而在面对高分辨率 图像时,AR模型遭遇瓶颈。不同于文本生成仅需少量token,图像合成中高分辨率图片往往需要数千个 token,计算成本随之暴增。这让许多基于 AR 的多模态模型只能处理低中分辨率图像,限制了其在精细 图像生成中的应用。尽管扩散模型(Diffusion Models)在高分辨率上表现强劲,但其复杂的采样过程和 较慢的推理速度也存在局限。(搜狐) 4.【Adobe发布Firefly Image Model 4模型:AI生图再升级】Adobe发布博文,推出Firefly Image Model 4和 Firefly Image Model 4 Ultra两款文本生成图像AI模型,并预告针对Photoshop和Illustrator的Crea ...
“计算机视觉被GPT-4o终结了”(狗头)
量子位· 2025-03-29 07:46
Core Viewpoint - The article discusses the advancements in computer vision (CV) and image generation capabilities brought by the new GPT-4o model, highlighting its potential to disrupt existing tools and methodologies in the field [1][2]. Group 1: Technological Advancements - GPT-4o introduces native multimodal image generation, expanding the functionalities of AI tools beyond traditional applications [2][12]. - The image generation process in GPT-4o is based on a self-regressive model, differing from the diffusion model used in DALL·E, which allows for better adherence to instructions and enhanced image editing capabilities [15][19]. - Observations suggest that the image generation may involve a multi-scale self-regressive combination, where a rough image is generated first, followed by detail filling while the rough shape evolves [17][19]. Group 2: Industry Impact - The advancements in GPT-4o's capabilities have raised concerns among designers and computer vision researchers, indicating a significant shift in the competitive landscape of AI tools [6][10]. - OpenAI's approach of scaling foundational models to achieve these capabilities has surprised many in the industry, suggesting a new trend in AI development [12][19]. - The potential for GPT-4o to enhance applications in autonomous driving has been noted, with implications for future developments in this sector [10]. Group 3: Community Engagement - The article encourages community members to share their experiences and innovative uses of GPT-4o, fostering a collaborative environment for exploring AI applications [26].