自回归模型 - filings, earnings calls, financial reports, news - Reportify

自回归模型

Search documents

扩散语言模型写代码！速度比自回归快10倍

量子位· 2025-07-10 03:19

Core Viewpoint - The article discusses the launch of Mercury, a new commercial-grade large language model based on diffusion technology, which can generate code at a significantly faster rate than traditional models. Group 1: Model Innovation - Mercury breaks the limitations of autoregressive models by predicting all tokens at once, enhancing generation speed [2] - The model allows for dynamic error correction during the generation process, providing greater flexibility compared to traditional models [4][20] - Despite using diffusion technology, Mercury retains the Transformer architecture, enabling the reuse of efficient training and inference optimization techniques [6][7] Group 2: Performance Metrics - Mercury's code generation speed can be up to 10 times faster than traditional tools, significantly reducing development cycles [8] - On H100 GPUs, Mercury achieves a throughput of 1109 tokens per second, showcasing its efficient use of hardware [9][13] - In benchmark tests, Mercury Coder Mini and Small achieved response times of 0.25 seconds and 0.31 seconds, respectively, outperforming many competitors [16] Group 3: Error Correction and Flexibility - The model incorporates a real-time error correction module that detects and corrects logical flaws in code during the denoising steps [21] - Mercury integrates abstract syntax trees (AST) from programming languages like Python and Java to minimize syntax errors [22] Group 4: Development Team - Inception Labs, the developer of Mercury, consists of a team of experts from prestigious institutions, including Stanford and UCLA, with a focus on improving model performance using diffusion technology [29][34]

大语言模型

自回归模型

Artificial Intelligence

Mercury Coder Mini

大语言模型

自回归模型

Artificial Intelligence

Mercury Coder Mini

首次！世界模型、动作模型融合，全自回归模型WorldVLA来了

机器之心· 2025-07-03 08:01

Core Viewpoint - Alibaba's Damo Academy has introduced WorldVLA, a model that integrates World Model and Action Model into a unified autoregressive framework, enhancing understanding and generation across text, images, and actions [1][4]. Summary by Sections Research Overview - The development of Vision-Language-Action (VLA) models has become a significant focus in robotic action modeling, typically built on large-scale pretrained multimodal language models (MLLMs) with added action output capabilities [4]. - Existing VLA models often lack a deep understanding of actions, treating them merely as output rather than analyzing them as input [5]. Model Description - WorldVLA addresses the limitations of both VLA and World Models by using a unified autoregressive mechanism for action and image understanding and generation [5][10]. - It employs three independent encoders for processing images, text, and action data, sharing the same vocabulary to facilitate cross-modal tasks [12]. Mechanism and Strategy - The World Model component generates visual representations based on input actions, learning the physical dynamics of the environment, while the Action Model enhances visual understanding [7]. - An action attention masking strategy is introduced to mitigate error accumulation during the generation of multiple actions, significantly improving performance in action chunking tasks [8][14]. Experimental Results - In the LIBERO benchmark, WorldVLA achieved a 4% improvement in grasp success rate compared to traditional action models and a 10% reduction in Fréchet Video Distance (FVD) compared to traditional world models [8]. - The introduction of the attention mask strategy led to a performance improvement in grasp success rates ranging from 4% to 23% in action chunking tasks [8]. Comparative Analysis - WorldVLA outperformed other models in various metrics, demonstrating its effectiveness in integrating action and world modeling [18]. - The model's ability to generate the next frame based on actions and images showcases its advanced capabilities in visual prediction [24].

视觉 - 语言 - 动作模型

自回归模型

视觉 - 语言 - 动作模型

自回归模型

冲击自回归，扩散模型正在改写下一代通用模型范式

机器之心· 2025-06-04 01:59

Core Viewpoint - The article discusses the advancements in diffusion language models (dLLMs), particularly focusing on Google's Gemini Diffusion and its implications for AI development, highlighting the speed and performance improvements over traditional autoregressive models [1][8][35]. Group 1: Gemini Diffusion and Its Features - Gemini Diffusion is noted for its impressive generation speed, being five times faster than previous models, and its ability to handle programming tasks effectively [2][8]. - The underlying mechanism of diffusion models allows for rapid iteration and error correction during the generation process, distinguishing it from autoregressive models [2][3]. - Gemini Diffusion's sampling speed can reach an astonishing 1479 tokens per second, showcasing its potential in various benchmarks [8][9]. Group 2: Development of Diffusion Language Models - Prior to Gemini Diffusion, several research teams explored the feasibility of diffusion-based LLMs, including Stanford's Diffusion-LM and Fudan University's DiffusionBERT [3][4]. - The introduction of LLaDA, the first 8 billion parameter diffusion language model, marked a significant milestone in the field, achieving performance comparable to LLaMA 3 [4][21]. - Following LLaDA, other models like d1 and LaViDa have emerged, further establishing LLaDA as a foundational model in dLLM research [20][21]. Group 3: Multimodal Diffusion Language Models - The emergence of diffusion multimodal language models (dMLLMs) is highlighted, with LLaDA-V and MMaDA being prominent examples that integrate visual and language processing capabilities [10][31]. - LLaDA-V combines visual instruction fine-tuning with the diffusion mechanism, demonstrating strong performance in multimodal understanding tasks [26][27]. - MMaDA showcases innovations in text reasoning and multimodal understanding, solidifying its position as a leading research outcome in the dMLLM space [31][32]. Group 4: Future Directions and Implications - The article emphasizes the shift from autoregressive models to diffusion models as a significant paradigm change in AI, suggesting broader implications for future research and applications [35][36]. - The ongoing evolution of models like LLaDA and Gemini Diffusion indicates a growing ecosystem around dLLMs and dMLLMs, with potential applications extending into quantum computing [35][36].

自回归模型

Artificial Intelligence

Gemini Diffusion

自回归模型

Artificial Intelligence

Gemini Diffusion

CVPR 2025 Highlight | 提升自回归模型样例学习能力，Few-shot图像编辑新范式开源

机器之心· 2025-06-01 03:30

Core Viewpoint - The article discusses the development of a new autoregressive model called InstaManip, which enhances in-context learning capabilities to better address the challenges of few-shot image editing [26]. Summary by Sections Introduction - Recent advancements in diffusion models have significantly improved text-guided image editing algorithms, but performance declines when user requests are difficult to describe or deviate from the training data distribution [1][2]. Problem Statement - The challenge arises when users want to edit images in ways that are not well-represented in the training dataset, such as transforming a regular car into a Lamborghini, which is hard to describe accurately with words alone [1]. Proposed Solution - To tackle this issue, the article suggests providing additional image examples alongside text instructions, allowing the model to learn desired transformations through few-shot image editing [2]. Model Structure and Methodology - The InstaManip model employs a novel group self-attention mechanism to learn image transformation features from both text and image examples, enabling it to edit new input images accordingly [6][15]. Learning Mechanism - The learning process is divided into two stages: the learning phase, where transferable knowledge is abstracted from examples, and the application phase, where this knowledge is applied to new scenarios [10][11]. Group Self-Attention Mechanism - The model incorporates multiple layers of group self-attention, which allows it to process text instructions and example images separately, enhancing the learning and application phases [16]. Relation Regularization - To mitigate noise from example images that could mislead the model, a relation regularization technique is introduced, aligning the learned similarities with those derived from text instructions [17]. Experimental Results - InstaManip outperforms previous models in both in-distribution and out-of-distribution settings, establishing itself as the state-of-the-art method for few-shot image editing [19][20]. Ablation Studies - Ablation experiments demonstrate that both the group self-attention mechanism and relation regularization significantly enhance model performance, confirming the necessity of each component [21][22]. Conclusion - The InstaManip model achieves superior results across multiple metrics and can further improve with an increased number of diverse example images [26].

关系正则化

few-shot图像编辑

自回归模型

分组自注意力机制

Artificial Intelligence

关系正则化

few-shot图像编辑

自回归模型

分组自注意力机制

Artificial Intelligence

扩散语言模型九倍推理加速！上海交大：KV Cache并非自回归模型的专属技巧

量子位· 2025-05-27 03:53

Core Viewpoint - The article discusses the introduction of dLLM-Cache, a novel inference caching mechanism for diffusion-based Large Language Models (dLLMs), which significantly enhances inference speed without compromising output quality [2][3][21]. Research Motivation - Diffusion-based language models are emerging as a significant paradigm in language generation, showcasing superior capabilities in tasks like "reversing the curse" and mathematical reasoning compared to autoregressive models (ARMs) [8][10]. - However, the inference process of dLLMs typically requires hundreds of denoising steps, leading to substantial computational costs and inefficiencies, particularly since existing acceleration methods like KV Cache are incompatible with the bidirectional attention architecture of dLLMs [10][11]. Method Overview - The authors identified that features of prompt tokens remain stable during the denoising process, allowing for their reuse, while only a small portion of response tokens exhibit significant changes [13][14]. - The V-verify mechanism was introduced to efficiently identify which response tokens require updates based on the changes in their underlying features, achieving a reduction of up to 75% in redundant computations [16][17][20]. Experimental Results - The effectiveness of dLLM-Cache was rigorously tested on LLaDA 8B and Dream 7B models across various benchmark tasks, demonstrating over 5 times acceleration in inference speed while maintaining or slightly improving model performance [21][25]. - In specific tasks like HotpotQA, dLLM-Cache achieved a remarkable 9.1 times speedup without loss of quality, showcasing its robust performance across different dLLM architectures [21][28]. General Applicability - The dLLM-Cache method was successfully applied to different dLLM architectures, confirming its versatility and effectiveness in enhancing inference efficiency across various models [25][28].

扩散式大语言模型

自回归模型

Artificial Intelligence

扩散式大语言模型

自回归模型

Artificial Intelligence

舍弃自回归！国内团队打造纯扩散多模态大模型LLaDA-V，理解任务新SOTA

机器之心· 2025-05-27 03:23

Core Viewpoint - The article discusses the development of LLaDA-V, a pure diffusion multimodal large language model (MLLM) that integrates visual instruction tuning, marking a significant breakthrough in multimodal understanding compared to traditional autoregressive methods [1][16]. Group 1: Model Development - The research team expanded LLaDA into the multimodal domain, introducing LLaDA-V, which utilizes a visual encoder (SigLIP 2) and an MLP connector to project visual features into the language embedding space, achieving effective multimodal alignment [2]. - LLaDA-V employs a discrete diffusion mechanism during both training and sampling phases, moving away from the autoregressive paradigm [2]. Group 2: Performance Highlights - LLaDA-V demonstrates strong data scalability and competitive performance, outperforming the autoregressive baseline LLaMA3-V in 11 multimodal tasks, despite LLaDA-8B being slightly inferior to LLaMA3-8B in pure text tasks [5]. - The model achieves state-of-the-art (SOTA) performance in multimodal understanding tasks compared to existing mixed autoregressive-diffusion models, validating the effectiveness of the MLLM architecture based on powerful language diffusion models [8]. - LLaDA-V significantly narrows the performance gap with top autoregressive MLLMs, achieving comparable results in benchmarks like MMStar [10]. Group 3: Core Methodology - The core of LLaDA-V lies in combining visual instruction tuning with LLaDA's masking diffusion mechanism, allowing for a robust training and inference process [13][15]. - The architecture consists of a classic "visual encoder + MLP projector + language model" setup, where the visual encoder extracts image features, and the MLP projector maps them to LLaDA's embedding space [15]. - LLaDA-V's training objective supports multi-turn multimodal dialogue by masking only the model's responses during training, optimizing the model's ability to generate coherent replies [15]. Group 4: Future Outlook - The successful integration of visual instruction tuning with masking diffusion models opens a new technical pathway for MLLM development, challenging the notion that multimodal intelligence must rely on autoregressive models [16]. - The ongoing advancement of language diffusion models is expected to play a more significant role in the future, further pushing the boundaries of multimodal AI [16].

多模态大语言模型

自回归模型

Artificial Intelligence

多模态大语言模型

自回归模型

Artificial Intelligence

12秒生成1万token！谷歌推出文本「扩散模型」Gemini Diffusion，研究员：演示都得降速看

量子位· 2025-05-21 10:39

Core Viewpoint - Google DeepMind has introduced Gemini Diffusion, a new language model that utilizes diffusion technology to significantly enhance text generation speed and quality compared to traditional autoregressive models [1][4][9]. Group 1: Technology and Performance - Gemini Diffusion can generate text at a speed of 2000 tokens per second, which is faster than the previous model, Gemini 2.0 Flash-Lite [7][11]. - The model employs a unique approach of refining noise to learn output generation, allowing for rapid iterations and error correction during the generation process [6][10][15]. - Unlike traditional models that generate one token at a time, Gemini Diffusion can generate entire blocks of tokens simultaneously, resulting in more coherent responses [14][9]. Group 2: Benchmarking and Comparisons - Benchmark tests show that Gemini Diffusion performs comparably to larger models, with specific metrics indicating it outperforms Gemini 2.0 Flash-Lite in several coding tasks [8]. - For example, in the HumanEval benchmark, Gemini Diffusion achieved a score of 76.0%, slightly higher than Gemini 2.0 Flash-Lite's 75.8% [8]. Group 3: Implications and Future Directions - The introduction of diffusion technology in language models suggests a potential shift towards more hybrid models in the future, as seen in similar research by other institutions [19][20]. - The ability to perform non-causal reasoning during text generation opens up new possibilities for complex problem-solving tasks that traditional autoregressive models struggle with [16][17].

自回归模型

Gemini Diffusion

Gemini 2.0 Flash - Lite

自回归模型

Gemini Diffusion

Gemini 2.0 Flash - Lite

阶跃星辰开源图像编辑模型Step1X-Edit；阿里巴巴AI旗舰应用夸克发布全新“AI相机”丨AIGC日报

创业邦· 2025-04-27 23:48

扫码订阅 AIGC 产业日报， 3.【Meta Token-Shuffle登场：自回归模型突破瓶颈，可AI生成 2048×2048 分辨率图像】报道称Meta AI创新推出Token-Shuffle，目标解决自回归（Autoregressive，AR）模型在生成高分辨率图像方面的扩展难题。在语言生成方面，自回归模型大放异彩，近年来也被广泛探索用于图像合成，然而在面对高分辨率图像时，AR模型遭遇瓶颈。不同于文本生成仅需少量token，图像合成中高分辨率图片往往需要数千个 token，计算成本随之暴增。这让许多基于 AR 的多模态模型只能处理低中分辨率图像，限制了其在精细图像生成中的应用。尽管扩散模型（Diffusion Models）在高分辨率上表现强劲，但其复杂的采样过程和较慢的推理速度也存在局限。（搜狐） 4.【Adobe发布Firefly Image Model 4模型：AI生图再升级】Adobe发布博文，推出Firefly Image Model 4和 Firefly Image Model 4 Ultra两款文本生成图像AI模型，并预告针对Photoshop和Illustrator的Crea ...

自回归模型

Artificial Intelligence

Firefly Image Model 4

Firefly Image Model 4 Ultra

自回归模型

Artificial Intelligence

Firefly Image Model 4

Firefly Image Model 4 Ultra

“计算机视觉被GPT-4o终结了”（狗头）

量子位· 2025-03-29 07:46

Core Viewpoint - The article discusses the advancements in computer vision (CV) and image generation capabilities brought by the new GPT-4o model, highlighting its potential to disrupt existing tools and methodologies in the field [1][2]. Group 1: Technological Advancements - GPT-4o introduces native multimodal image generation, expanding the functionalities of AI tools beyond traditional applications [2][12]. - The image generation process in GPT-4o is based on a self-regressive model, differing from the diffusion model used in DALL·E, which allows for better adherence to instructions and enhanced image editing capabilities [15][19]. - Observations suggest that the image generation may involve a multi-scale self-regressive combination, where a rough image is generated first, followed by detail filling while the rough shape evolves [17][19]. Group 2: Industry Impact - The advancements in GPT-4o's capabilities have raised concerns among designers and computer vision researchers, indicating a significant shift in the competitive landscape of AI tools [6][10]. - OpenAI's approach of scaling foundational models to achieve these capabilities has surprised many in the industry, suggesting a new trend in AI development [12][19]. - The potential for GPT-4o to enhance applications in autonomous driving has been noted, with implications for future developments in this sector [10]. Group 3: Community Engagement - The article encourages community members to share their experiences and innovative uses of GPT-4o, fostering a collaborative environment for exploring AI applications [26].

计算机视觉

自回归模型

多模态模型

Artificial Intelligence

计算机视觉

自回归模型

多模态模型

Artificial Intelligence