自回归模型

Search documents
NextStep-1:一次在图像生成上自回归范式的探索
机器之心· 2025-08-18 05:15
机器之心发布 机器之心编辑部 自回归模型,是 AIGC 领域一块迷人的基石。开发者们一直在探索它在 视觉生成 领 域 的边界,从经典的离散序列生成,到结合强大扩散模型的混合范式,每一 步都凝聚了社区的智慧。 这些工作,比如 MAR、Fluid、LatentLM 等,为我们带来了巨大的启发,也让我们看到了进一步优化的空间:比如,如何避免离散化带来的信息损失?如何让模 型的架构更轻盈、更强大? 为实现这一点,团队采用了一个轻量的「 流匹配头 」(Flow Matching Head)。它让模型能够: 这一设计带来了另一个显著优势: 架构的简洁与纯粹 。由于不再需要外部大型扩散模型的 「辅助」,NextStep-1 的整体架构变得高度统一,实现了真正意义上的 端到端训练。 阶跃星辰团队 认为,NextStep-1 的探索指向了一个有趣且充满潜力的方向。它证明了在不牺牲连续性的前提下,构建一个简洁、高效的自回归模型是完全可行 的。 这只是探索的第一步。 阶跃星辰 选择将 NextStep-1 开源, 衷心期待它能引发更多有价值的讨论,并希望能与社 区的研究者一起 ,继续推动生成技术的演进 。 带着这些问题, 阶跃星辰 ...
Lumina-mGPT 2.0:自回归模型华丽复兴,媲美顶尖扩散模型
机器之心· 2025-08-12 00:15
辑、可控生成和密集预测在内的广泛任务。 本文第一作者辛毅为南京大学 & 上海创智学院博士生,现于上海人工智能实验室实习,研究方向为图像 / 视频生成、多模态生成与理解统一等。通讯作者为上海 人工智能实验室青年科学家 — 高鹏。本文其他作者来自上海人工智能实验室、香港中文大学、上海交通大学、上海创智学院、浙江工业大学等。 核心技术与突破 完全独立的训练架构 不同于依赖预训练权重的传统方案,Lumina-mGPT 2.0 采用纯解码器 Transformer 架构,从参数初始化开始完全独立训练。这带来三大优势:架构设计不受限制 (提供了 20 亿和 70 亿参数两个版本)、规避授权限制(如 Chameleon 的版权问题)、减少预训练模型带来的固有偏差。 上海人工智能实验室等团队提出Lumina-mGPT 2.0 —— 一款独立的、仅使用解码器的自回归模型,统一了包括文生图、图像对生成、主体驱动生成、多轮图像编 论文标题:Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling 论文链接:arxiv.org/pdf/2507.17801 GitHub 地 ...
自回归模型杀回图像生成!实现像素级精准控制,比Diffusion更高效可控
量子位· 2025-07-29 05:05
Core Viewpoint - The article discusses the limitations of Diffusion models in AI image generation, particularly in precise control, and introduces a new framework called MENTOR, which utilizes Autoregressive (AR) models for more efficient and controllable multimodal image generation [1][2][3]. Group 1: Challenges in Current Models - Diffusion models face challenges in precise visual control, balancing multimodal inputs, and high training costs [2][6]. - The inherent randomness of Diffusion models makes it difficult to achieve precise control in high-fidelity tasks like image reconstruction [6]. - Existing methods often exhibit modality imbalance, over-relying on either reference images or text instructions [6]. Group 2: Introduction of MENTOR - MENTOR is a novel AR framework that requires only one-tenth of the training data and suboptimal model components to outperform Diffusion methods like Emu2 and DreamEngine [2][3]. - The framework employs a unique two-stage training method to enable efficient multimodal image generation with pixel-level precision [3][8]. Group 3: MENTOR's Design and Training - MENTOR features a unified AR architecture consisting of a multimodal encoder and an autoregressive generator, allowing for token-level alignment between inputs and outputs [9]. - The two-stage training strategy includes: 1. Multimodal Alignment Pretraining: Focuses on understanding different input types and establishing pixel-level and semantic alignment [10]. 2. Multimodal Instruction Tuning: Enhances the model's ability to follow instructions and reason across modalities [12]. Group 4: Performance and Efficiency - MENTOR achieved competitive performance on DreamBench++, surpassing larger models like Emu2 (37 billion parameters) and DreamEngine (10.5 billion parameters) while maintaining a lower CP/PF ratio, indicating better balance between visual feature preservation and prompt following [15][17]. - The training process for MENTOR utilized approximately 3 million image-text pairs over 1.5 days, demonstrating significant efficiency compared to other baseline methods [18]. Group 5: Applications and Future Potential - MENTOR's framework is highly versatile, capable of handling various complex multimodal generation tasks with minimal adjustments [24]. - The article concludes that MENTOR opens a new path for controllable image generation tasks, showcasing the potential of AR models in visual generation, while acknowledging that there are still areas where it lags behind top-tier Diffusion models [26].
五倍推理加速,激发自回归潜能,苹果新工作让LLM预测未来
机器之心· 2025-07-24 04:08
Core Viewpoint - The article discusses the advancements in language models, particularly focusing on a new framework developed by Apple researchers that allows autoregressive models to perform multi-token predictions, significantly improving inference speed while maintaining generation quality [7][8][9]. Group 1: Advances in Language Models - Recent progress in language models is attributed to the availability of large-scale text data and the effectiveness of autoregressive training methods [2]. - Autoregressive models predict each token based on preceding context, which provides a clear advantage during training but incurs high computational costs during inference due to sequential execution [5][6]. Group 2: New Framework Development - Apple researchers have developed a framework that enables pre-trained autoregressive language models to execute multi-token predictions, achieving up to 5.35 times speedup for code and math tasks, and approximately 2.5 times for general tasks [7]. - This innovation allows for a significant reduction in AI operational costs and the potential for powerful real-time assistants to run smoothly on lightweight devices [9]. Group 3: Research Findings - The researchers confirmed that language models can generate multiple tokens in a single inference step, which is a promising development for speeding up generation processes [11]. - The study explored whether it is possible to train truly non-autoregressive language models, leading to the design of a training algorithm that minimally alters existing autoregressive frameworks while achieving efficient multi-token generation [13][14]. Group 4: Experimental Results - Experiments conducted on the Tulu3-8B model demonstrated that the proposed multi-token generation algorithm achieved speedups ranging from approximately 1.5 to 5.2 times across various tasks, with the most significant improvements observed in programming and math tasks [46]. - The introduction of mask tokens and a lightweight sampling module allowed the model to leverage its full depth and representational capabilities, resulting in superior performance compared to existing multi-token prediction methods [23][24]. Group 5: Future Directions - Future research could explore the applicability of this method during pre-training or downstream task adaptation phases to further assess its effectiveness [53]. - Another promising direction is the application of diffusion-based generation methods to multi-token prediction tasks, aiming to balance efficiency and quality [53].
扩散语言模型写代码!速度比自回归快10倍
量子位· 2025-07-10 03:19
Core Viewpoint - The article discusses the launch of Mercury, a new commercial-grade large language model based on diffusion technology, which can generate code at a significantly faster rate than traditional models. Group 1: Model Innovation - Mercury breaks the limitations of autoregressive models by predicting all tokens at once, enhancing generation speed [2] - The model allows for dynamic error correction during the generation process, providing greater flexibility compared to traditional models [4][20] - Despite using diffusion technology, Mercury retains the Transformer architecture, enabling the reuse of efficient training and inference optimization techniques [6][7] Group 2: Performance Metrics - Mercury's code generation speed can be up to 10 times faster than traditional tools, significantly reducing development cycles [8] - On H100 GPUs, Mercury achieves a throughput of 1109 tokens per second, showcasing its efficient use of hardware [9][13] - In benchmark tests, Mercury Coder Mini and Small achieved response times of 0.25 seconds and 0.31 seconds, respectively, outperforming many competitors [16] Group 3: Error Correction and Flexibility - The model incorporates a real-time error correction module that detects and corrects logical flaws in code during the denoising steps [21] - Mercury integrates abstract syntax trees (AST) from programming languages like Python and Java to minimize syntax errors [22] Group 4: Development Team - Inception Labs, the developer of Mercury, consists of a team of experts from prestigious institutions, including Stanford and UCLA, with a focus on improving model performance using diffusion technology [29][34]
首次!世界模型、动作模型融合,全自回归模型WorldVLA来了
机器之心· 2025-07-03 08:01
Core Viewpoint - Alibaba's Damo Academy has introduced WorldVLA, a model that integrates World Model and Action Model into a unified autoregressive framework, enhancing understanding and generation across text, images, and actions [1][4]. Summary by Sections Research Overview - The development of Vision-Language-Action (VLA) models has become a significant focus in robotic action modeling, typically built on large-scale pretrained multimodal language models (MLLMs) with added action output capabilities [4]. - Existing VLA models often lack a deep understanding of actions, treating them merely as output rather than analyzing them as input [5]. Model Description - WorldVLA addresses the limitations of both VLA and World Models by using a unified autoregressive mechanism for action and image understanding and generation [5][10]. - It employs three independent encoders for processing images, text, and action data, sharing the same vocabulary to facilitate cross-modal tasks [12]. Mechanism and Strategy - The World Model component generates visual representations based on input actions, learning the physical dynamics of the environment, while the Action Model enhances visual understanding [7]. - An action attention masking strategy is introduced to mitigate error accumulation during the generation of multiple actions, significantly improving performance in action chunking tasks [8][14]. Experimental Results - In the LIBERO benchmark, WorldVLA achieved a 4% improvement in grasp success rate compared to traditional action models and a 10% reduction in Fréchet Video Distance (FVD) compared to traditional world models [8]. - The introduction of the attention mask strategy led to a performance improvement in grasp success rates ranging from 4% to 23% in action chunking tasks [8]. Comparative Analysis - WorldVLA outperformed other models in various metrics, demonstrating its effectiveness in integrating action and world modeling [18]. - The model's ability to generate the next frame based on actions and images showcases its advanced capabilities in visual prediction [24].
冲击自回归,扩散模型正在改写下一代通用模型范式
机器之心· 2025-06-04 01:59
Core Viewpoint - The article discusses the advancements in diffusion language models (dLLMs), particularly focusing on Google's Gemini Diffusion and its implications for AI development, highlighting the speed and performance improvements over traditional autoregressive models [1][8][35]. Group 1: Gemini Diffusion and Its Features - Gemini Diffusion is noted for its impressive generation speed, being five times faster than previous models, and its ability to handle programming tasks effectively [2][8]. - The underlying mechanism of diffusion models allows for rapid iteration and error correction during the generation process, distinguishing it from autoregressive models [2][3]. - Gemini Diffusion's sampling speed can reach an astonishing 1479 tokens per second, showcasing its potential in various benchmarks [8][9]. Group 2: Development of Diffusion Language Models - Prior to Gemini Diffusion, several research teams explored the feasibility of diffusion-based LLMs, including Stanford's Diffusion-LM and Fudan University's DiffusionBERT [3][4]. - The introduction of LLaDA, the first 8 billion parameter diffusion language model, marked a significant milestone in the field, achieving performance comparable to LLaMA 3 [4][21]. - Following LLaDA, other models like d1 and LaViDa have emerged, further establishing LLaDA as a foundational model in dLLM research [20][21]. Group 3: Multimodal Diffusion Language Models - The emergence of diffusion multimodal language models (dMLLMs) is highlighted, with LLaDA-V and MMaDA being prominent examples that integrate visual and language processing capabilities [10][31]. - LLaDA-V combines visual instruction fine-tuning with the diffusion mechanism, demonstrating strong performance in multimodal understanding tasks [26][27]. - MMaDA showcases innovations in text reasoning and multimodal understanding, solidifying its position as a leading research outcome in the dMLLM space [31][32]. Group 4: Future Directions and Implications - The article emphasizes the shift from autoregressive models to diffusion models as a significant paradigm change in AI, suggesting broader implications for future research and applications [35][36]. - The ongoing evolution of models like LLaDA and Gemini Diffusion indicates a growing ecosystem around dLLMs and dMLLMs, with potential applications extending into quantum computing [35][36].
CVPR 2025 Highlight | 提升自回归模型样例学习能力,Few-shot图像编辑新范式开源
机器之心· 2025-06-01 03:30
Core Viewpoint - The article discusses the development of a new autoregressive model called InstaManip, which enhances in-context learning capabilities to better address the challenges of few-shot image editing [26]. Summary by Sections Introduction - Recent advancements in diffusion models have significantly improved text-guided image editing algorithms, but performance declines when user requests are difficult to describe or deviate from the training data distribution [1][2]. Problem Statement - The challenge arises when users want to edit images in ways that are not well-represented in the training dataset, such as transforming a regular car into a Lamborghini, which is hard to describe accurately with words alone [1]. Proposed Solution - To tackle this issue, the article suggests providing additional image examples alongside text instructions, allowing the model to learn desired transformations through few-shot image editing [2]. Model Structure and Methodology - The InstaManip model employs a novel group self-attention mechanism to learn image transformation features from both text and image examples, enabling it to edit new input images accordingly [6][15]. Learning Mechanism - The learning process is divided into two stages: the learning phase, where transferable knowledge is abstracted from examples, and the application phase, where this knowledge is applied to new scenarios [10][11]. Group Self-Attention Mechanism - The model incorporates multiple layers of group self-attention, which allows it to process text instructions and example images separately, enhancing the learning and application phases [16]. Relation Regularization - To mitigate noise from example images that could mislead the model, a relation regularization technique is introduced, aligning the learned similarities with those derived from text instructions [17]. Experimental Results - InstaManip outperforms previous models in both in-distribution and out-of-distribution settings, establishing itself as the state-of-the-art method for few-shot image editing [19][20]. Ablation Studies - Ablation experiments demonstrate that both the group self-attention mechanism and relation regularization significantly enhance model performance, confirming the necessity of each component [21][22]. Conclusion - The InstaManip model achieves superior results across multiple metrics and can further improve with an increased number of diverse example images [26].
扩散语言模型九倍推理加速!上海交大:KV Cache并非自回归模型的专属技巧
量子位· 2025-05-27 03:53
Core Viewpoint - The article discusses the introduction of dLLM-Cache, a novel inference caching mechanism for diffusion-based Large Language Models (dLLMs), which significantly enhances inference speed without compromising output quality [2][3][21]. Research Motivation - Diffusion-based language models are emerging as a significant paradigm in language generation, showcasing superior capabilities in tasks like "reversing the curse" and mathematical reasoning compared to autoregressive models (ARMs) [8][10]. - However, the inference process of dLLMs typically requires hundreds of denoising steps, leading to substantial computational costs and inefficiencies, particularly since existing acceleration methods like KV Cache are incompatible with the bidirectional attention architecture of dLLMs [10][11]. Method Overview - The authors identified that features of prompt tokens remain stable during the denoising process, allowing for their reuse, while only a small portion of response tokens exhibit significant changes [13][14]. - The V-verify mechanism was introduced to efficiently identify which response tokens require updates based on the changes in their underlying features, achieving a reduction of up to 75% in redundant computations [16][17][20]. Experimental Results - The effectiveness of dLLM-Cache was rigorously tested on LLaDA 8B and Dream 7B models across various benchmark tasks, demonstrating over 5 times acceleration in inference speed while maintaining or slightly improving model performance [21][25]. - In specific tasks like HotpotQA, dLLM-Cache achieved a remarkable 9.1 times speedup without loss of quality, showcasing its robust performance across different dLLM architectures [21][28]. General Applicability - The dLLM-Cache method was successfully applied to different dLLM architectures, confirming its versatility and effectiveness in enhancing inference efficiency across various models [25][28].
舍弃自回归!国内团队打造纯扩散多模态大模型LLaDA-V,理解任务新SOTA
机器之心· 2025-05-27 03:23
Core Viewpoint - The article discusses the development of LLaDA-V, a pure diffusion multimodal large language model (MLLM) that integrates visual instruction tuning, marking a significant breakthrough in multimodal understanding compared to traditional autoregressive methods [1][16]. Group 1: Model Development - The research team expanded LLaDA into the multimodal domain, introducing LLaDA-V, which utilizes a visual encoder (SigLIP 2) and an MLP connector to project visual features into the language embedding space, achieving effective multimodal alignment [2]. - LLaDA-V employs a discrete diffusion mechanism during both training and sampling phases, moving away from the autoregressive paradigm [2]. Group 2: Performance Highlights - LLaDA-V demonstrates strong data scalability and competitive performance, outperforming the autoregressive baseline LLaMA3-V in 11 multimodal tasks, despite LLaDA-8B being slightly inferior to LLaMA3-8B in pure text tasks [5]. - The model achieves state-of-the-art (SOTA) performance in multimodal understanding tasks compared to existing mixed autoregressive-diffusion models, validating the effectiveness of the MLLM architecture based on powerful language diffusion models [8]. - LLaDA-V significantly narrows the performance gap with top autoregressive MLLMs, achieving comparable results in benchmarks like MMStar [10]. Group 3: Core Methodology - The core of LLaDA-V lies in combining visual instruction tuning with LLaDA's masking diffusion mechanism, allowing for a robust training and inference process [13][15]. - The architecture consists of a classic "visual encoder + MLP projector + language model" setup, where the visual encoder extracts image features, and the MLP projector maps them to LLaDA's embedding space [15]. - LLaDA-V's training objective supports multi-turn multimodal dialogue by masking only the model's responses during training, optimizing the model's ability to generate coherent replies [15]. Group 4: Future Outlook - The successful integration of visual instruction tuning with masking diffusion models opens a new technical pathway for MLLM development, challenging the notion that multimodal intelligence must rely on autoregressive models [16]. - The ongoing advancement of language diffusion models is expected to play a more significant role in the future, further pushing the boundaries of multimodal AI [16].