Workflow
自回归模型
icon
Search documents
速递|斯坦福教授创业,Inception获5000万美元种子轮融资,用扩散模型解锁实时AI应用
Z Potentials· 2025-11-07 02:12
Core Insights - The article discusses the current surge of funding into AI startups, highlighting it as a golden period for AI researchers to validate their ideas [1] - Inception, a startup developing diffusion-based AI models, recently raised $50 million in seed funding, led by Menlo Ventures with participation from several notable investors [2] Company Overview - Inception is focused on developing diffusion models, which generate outputs through iterative optimization rather than sequential generation [3] - The project leader, Stefano Ermon, has been researching diffusion models prior to the recent AI boom and aims to apply these models to a broader range of tasks [3] Technology and Innovation - Inception has released a new version of its Mercury model, designed specifically for software development, which has been integrated into various development tools [3] - Ermon claims that diffusion-based models will significantly optimize two critical metrics: latency and computational cost, stating that these models are faster and more efficient than those built by other companies [3][5] - Diffusion models differ structurally from autoregressive models, which dominate text-based AI services, and are believed to perform better when handling large volumes of text or data limitations [5] Performance Metrics - The diffusion models exhibit greater flexibility in hardware utilization, which is increasingly important as AI's infrastructure demands grow [5] - Ermon's benchmarks indicate that the models can process over 1,000 tokens per second, surpassing the capabilities of existing autoregressive technologies due to their inherent support for parallel processing [5]
上海AI Lab发布混合扩散语言模型SDAR:首个突破6600 tgs的开源扩散语言模型
机器之心· 2025-11-01 04:22
Core Insights - The article introduces a new paradigm called SDAR (Synergistic Diffusion-AutoRegression) that addresses the slow inference speed and high costs associated with large model applications, which are primarily due to the serial nature of autoregressive (AR) models [2][3][4]. Group 1: SDAR Paradigm - SDAR effectively decouples training and inference, combining the high performance of AR models with the parallel inference advantages of diffusion models, allowing for low-cost transformation of any AR model into a parallel decoding model [4][11]. - Experimental results show that SDAR not only matches but often surpasses the performance of original AR models across multiple benchmarks, achieving up to a 12.3 percentage point advantage in complex scientific reasoning tasks [6][28]. Group 2: Performance and Efficiency - SDAR maintains the performance of AR models while significantly improving inference speed and reducing costs, demonstrating that larger models benefit more from parallelization without sacrificing performance [17][19]. - The research indicates that SDAR can be adapted to any mainstream AR model at a low cost, achieving comparable or superior performance in downstream tasks [19][29]. Group 3: Experimental Validation - The study conducted rigorous experiments to compare SDAR's performance with AR models, confirming that SDAR can achieve substantial speed improvements in real-world applications, with SDAR-8B-chat showing a 2.3 times acceleration over its AR counterpart [23][20]. - The results highlight that SDAR's unique generation mechanism does not compromise its complex reasoning capabilities, retaining long-chain reasoning abilities and excelling in tasks requiring understanding of structured information [28][29]. Group 4: Future Implications - SDAR represents a significant advancement in the field of large models, providing a powerful and flexible tool that lowers application barriers and opens new avenues for exploring higher performance and efficiency in AI reasoning paradigms [29][31].
视觉生成的另一条路:Infinity 自回归架构的原理与实践
AI前线· 2025-10-31 05:42
Core Insights - The article discusses the significant advancements in visual autoregressive models, particularly highlighting the potential of these models in the context of AI-generated content (AIGC) and their competitive edge against diffusion models [2][4][11]. Group 1: Visual Autoregressive Models - Visual autoregressive models (VAR) utilize a "coarse-to-fine" approach, starting with low-resolution images and progressively refining them to high-resolution outputs, which aligns more closely with human visual perception [12][18]. - The VAR model architecture includes an improved VQ-VAE that employs a hierarchical structure, allowing for efficient encoding and reconstruction of images while minimizing token usage [15][30]. - VAR has demonstrated superior image generation quality compared to existing models like DiT, showcasing a robust scaling curve that indicates performance improvements with increased model size and computational resources [18][49]. Group 2: Comparison with Diffusion Models - Diffusion models operate by adding Gaussian noise to images and then training a network to reverse this process, maintaining the original resolution throughout [21][25]. - The key advantages of VAR over diffusion models include higher training parallelism and a more intuitive process that mimics human visual cognition, although diffusion models can correct errors through iterative refinement [27][29]. - VAR's approach allows for faster inference times, with the Infinity model achieving significant speed improvements over comparable diffusion models [46][49]. Group 3: Innovations in Tokenization and Error Correction - The Infinity framework introduces a novel "bitwise tokenizer" that enhances reconstruction quality while allowing for a larger vocabulary size, thus improving detail and instruction adherence in generated images [31][41]. - A self-correction mechanism is integrated into the training process, enabling the model to learn from previous errors and significantly reducing cumulative error during inference [35][40]. - The findings indicate that larger models benefit from larger vocabularies, reinforcing the reliability of scaling laws in model performance [41][49].
从300多篇工作中,看VLA在不同场景下的应用和实现......
具身智能之心· 2025-09-25 04:00
Core Insights - The article discusses the emergence of Vision Language Action (VLA) models, marking a shift in robotics from traditional strategy-based control to a more generalized robotic technology paradigm, enabling active decision-making in complex environments [2][5][20] - It emphasizes the integration of large language models (LLMs) and vision-language models (VLMs) to enhance robotic operations, providing greater flexibility and precision in task execution [6][12] - The survey outlines a clear classification system for VLA methods, categorizing them into autoregressive, diffusion, reinforcement learning, hybrid, and specialized methods, while also addressing the unique contributions and challenges within each category [7][10][22] Group 1: VLA Model Overview - VLA models represent a significant advancement in robotics, allowing for the unification of perception, language understanding, and executable control within a single modeling framework [15][20] - The article categorizes VLA methods into five paradigms: autoregressive, diffusion, reinforcement learning, hybrid, and specialized, detailing their design motivations and core strategies [10][22][23] - The integration of LLMs into VLA systems transforms them from passive input parsers to semantic intermediaries, enhancing their ability to handle long and complex tasks [29][30] Group 2: Applications and Challenges - VLA models have practical applications across various robotic forms, including robotic arms, quadrupeds, humanoid robots, and autonomous vehicles, showcasing their deployment in diverse scenarios [8][20] - The article identifies key challenges in the VLA field, such as data limitations, reasoning speed, and safety concerns, which need to be addressed to accelerate the development of VLA models and general robotic technology [8][19][20] - The reliance on high-quality datasets and simulation platforms is crucial for the effective training and evaluation of VLA models, addressing issues of data scarcity and real-world testing risks [16][19] Group 3: Future Directions - The survey outlines future research directions for VLA, including addressing data limitations, enhancing reasoning speed, and improving safety measures to facilitate the advancement of general embodied intelligence [8][20][21] - It highlights the importance of developing scalable and efficient VLA models that can adapt to various tasks and environments, emphasizing the need for ongoing innovation in this rapidly evolving field [20][39] - The article concludes by underscoring the potential of VLA models to bridge the gap between perception, understanding, and action, positioning them as a key frontier in embodied artificial intelligence [20][21][39]
深度综述 | 300+论文带你看懂:纯视觉如何将VLA推向自动驾驶和具身智能巅峰!
自动驾驶之心· 2025-09-24 23:33
Core Insights - The emergence of Vision Language Action (VLA) models signifies a paradigm shift in robotics from traditional strategy-based control to general-purpose robotic technology, transforming Vision Language Models (VLMs) from passive sequence generators to active agents capable of executing operations and making decisions in complex, dynamic environments [1][5][11] Summary by Sections Introduction - Robotics has historically relied on pre-programmed instructions and control strategies for task execution, primarily in simple, repetitive tasks [5] - Recent advancements in AI and deep learning have enabled the integration of perception, detection, tracking, and localization technologies, leading to the development of embodied intelligence and autonomous driving [5] - Current robots often operate as "isolated agents," lacking effective interaction with humans and external environments, prompting researchers to explore the integration of Large Language Models (LLMs) and VLMs for more precise and flexible robotic operations [5][6] Background - The development of VLA models marks a significant step towards general embodied intelligence, unifying visual perception, language understanding, and executable control within a single modeling framework [11][16] - The evolution of VLA models is supported by breakthroughs in single-modal foundational models across computer vision, natural language processing, and reinforcement learning [13][16] VLA Models Overview - VLA models have rapidly developed due to advancements in multi-modal representation learning, generative modeling, and reinforcement learning [24] - The core design of VLA models includes the integration of visual encoding, LLM reasoning, and decision-making frameworks, aiming to bridge the gap between perception, understanding, and action [23][24] VLA Methodologies - VLA methods are categorized into five paradigms: autoregressive, diffusion models, reinforcement learning, hybrid methods, and specialized approaches, each with distinct design motivations and core strategies [6][24] - Autoregressive models focus on sequential generation of actions based on historical context and task instructions, demonstrating scalability and robustness [26][28] Applications and Resources - VLA models are applicable in various robotic domains, including robotic arms, quadrupedal robots, humanoid robots, and wheeled robots (autonomous vehicles) [7] - The development of VLA models heavily relies on high-quality datasets and simulation platforms to address challenges related to data scarcity and high risks in real-world testing [17][21] Challenges and Future Directions - Key challenges in the VLA field include data limitations, reasoning speed, and safety concerns, which need to be addressed to accelerate the development of VLA models and general robotic technologies [7][18] - Future research directions are outlined to enhance the capabilities of VLA models, focusing on improving data diversity, enhancing reasoning mechanisms, and ensuring safety in real-world applications [7][18] Conclusion - The review emphasizes the need for a clear classification system for pure VLA methods, highlighting the significant features and innovations of each category, and providing insights into the resources necessary for training and evaluating VLA models [9][24]
扩散语言模型也有MoE版本了!蚂蚁&人大从头训练LLaDA-MoE,即将完全开源
机器之心· 2025-09-12 11:31
Core Viewpoint - The article discusses the development of the LLaDA-MoE model, the first native MoE architecture diffusion language model trained from scratch, which demonstrates significant performance and efficiency advantages over traditional autoregressive models [2][15][18]. Group 1: Model Development and Performance - The LLaDA-MoE model was trained on 20 terabytes of data and features 1.4 billion active parameters, achieving performance comparable to denser autoregressive models like Qwen2.5-3B while maintaining faster inference speeds [15][17][29]. - The LLaDA series has rapidly evolved, with LLaDA-MoE being a notable milestone, surpassing previous models like LLaDA1.0/1.5 and Dream-7B in various benchmark tests [13][18][29]. - The model's architecture allows for significant scaling potential, with plans to explore higher sparsity ratios and larger MoE diffusion language models [29][40]. Group 2: Technical Innovations and Advantages - The diffusion model approach allows for parallel decoding, bidirectional modeling, and iterative correction, addressing limitations of autoregressive models such as serial bottlenecks and lack of error correction capabilities [38][40]. - Evidence suggests that diffusion language models can achieve better learning outcomes than autoregressive models, particularly in scenarios with limited data, demonstrating a data utilization efficiency that can exceed three times that of autoregressive models [40][41]. - The training framework and infrastructure developed by Ant Group, including the ATorch framework, supports the efficient training of large-scale MoE models [25][26]. Group 3: Strategic Vision and Future Directions - The development of LLaDA-MoE reflects a strategic choice to explore high-potential areas in AI, moving beyond established paths to enhance the limits of intelligence [44][47]. - Ant Group's commitment to innovation is evident in its previous projects and ongoing research in areas like dynamic MoE architectures and hybrid linear architectures, all aimed at achieving general artificial intelligence (AGI) [45][46][47].
NextStep-1:一次在图像生成上自回归范式的探索
机器之心· 2025-08-18 05:15
Core Insights - The article discusses the development of NextStep-1, a new autoregressive model for image generation that operates directly in continuous visual space, avoiding the information loss associated with discretization [2][3][4] - The model utilizes a lightweight Flow Matching Head, which simplifies the architecture and allows for end-to-end training without reliance on external diffusion models [4][5] - The exploration aims to provide a new perspective in the multimodal generation field, emphasizing the potential for creating efficient and high-fidelity generative models [26][33] Technical Framework - NextStep-1 is built on a powerful Transformer backbone network with 14 billion parameters, complemented by a Flow Matching Head with 157 million parameters for generating continuous image patches [7][8] - The model generates images autoregressively by producing patches sequentially, which helps bypass the bottleneck of discretization [8] - The architecture is designed to be simple and pure, demonstrating that a streamlined autoregressive model can be constructed without sacrificing continuity [4][26] Key Discoveries - The team identified that the Transformer acts as the main creator, while the Flow Matching Head serves as an efficient sampler, with minimal impact on image quality from the size of the Flow Matching Head [12] - Two critical techniques were discovered for stability and quality: channel-wise normalization to stabilize token statistics and the counterintuitive finding that adding more noise during training can enhance image quality [14][16] Performance Evaluation - NextStep-1 has been rigorously evaluated against industry benchmarks, achieving competitive results with state-of-the-art diffusion models [21][22] - The model's performance metrics include GenEval scores of 0.63/0.737 and DPG-Bench scores of 85.28, indicating its strong capabilities in image generation [21][22] Limitations and Future Directions - The model faces challenges related to stability during generation, particularly when expanding the latent space dimensions, which can lead to occasional failures [27][29] - The autoregressive nature of the model introduces latency issues, particularly in sequential decoding, which affects overall performance [28] - Future work will focus on optimizing the Flow Matching Head, accelerating the autoregressive backbone, and improving convergence efficiency, especially in high-resolution image generation [34][35]
Lumina-mGPT 2.0:自回归模型华丽复兴,媲美顶尖扩散模型
机器之心· 2025-08-12 00:15
Core Viewpoint - Lumina-mGPT 2.0 is an innovative stand-alone autoregressive image model that integrates various tasks such as text-to-image generation, subject-driven generation, and controllable generation, showcasing significant advancements in image generation technology [5][9][21]. Group 1: Core Technology and Breakthroughs - Lumina-mGPT 2.0 employs a fully independent training architecture, utilizing a pure decoder Transformer model, which allows for two parameter versions (2 billion and 7 billion) and avoids biases from pre-trained models [4][5]. - The model incorporates a high-quality image tokenizer, SBER-MoVQGAN, which was selected based on its optimal reconstruction quality on the MS-COCO dataset [7]. - A unified multi-task processing framework is introduced, enabling seamless support for various tasks including text-to-image generation and image editing [9]. Group 2: Efficient Inference Strategies - The model introduces two optimizations to enhance generation speed while maintaining quality, including model quantization to 4-bit integers and a sampling method that reduces GPU memory consumption by 60% [11][13]. - The optimizations allow for parallel decoding, significantly accelerating the generation process [13]. Group 3: Experimental Results - In text-to-image generation benchmarks, Lumina-mGPT 2.0 achieved a GenEval score of 0.80, ranking it among the top generative models, particularly excelling in tests involving "two objects" and "color attributes" [14][15]. - The model demonstrated superior performance in the Graph200K multi-task benchmark, confirming the feasibility of a pure autoregressive model for multi-modal generation tasks [17]. Group 4: Future Directions - Despite optimizations, Lumina-mGPT 2.0 still faces challenges with sampling time, which affects user experience, indicating a need for further enhancements [21]. - The focus will expand from multi-modal generation to include multi-modal understanding, aiming to improve overall functionality and performance [21].
自回归模型杀回图像生成!实现像素级精准控制,比Diffusion更高效可控
量子位· 2025-07-29 05:05
Core Viewpoint - The article discusses the limitations of Diffusion models in AI image generation, particularly in precise control, and introduces a new framework called MENTOR, which utilizes Autoregressive (AR) models for more efficient and controllable multimodal image generation [1][2][3]. Group 1: Challenges in Current Models - Diffusion models face challenges in precise visual control, balancing multimodal inputs, and high training costs [2][6]. - The inherent randomness of Diffusion models makes it difficult to achieve precise control in high-fidelity tasks like image reconstruction [6]. - Existing methods often exhibit modality imbalance, over-relying on either reference images or text instructions [6]. Group 2: Introduction of MENTOR - MENTOR is a novel AR framework that requires only one-tenth of the training data and suboptimal model components to outperform Diffusion methods like Emu2 and DreamEngine [2][3]. - The framework employs a unique two-stage training method to enable efficient multimodal image generation with pixel-level precision [3][8]. Group 3: MENTOR's Design and Training - MENTOR features a unified AR architecture consisting of a multimodal encoder and an autoregressive generator, allowing for token-level alignment between inputs and outputs [9]. - The two-stage training strategy includes: 1. Multimodal Alignment Pretraining: Focuses on understanding different input types and establishing pixel-level and semantic alignment [10]. 2. Multimodal Instruction Tuning: Enhances the model's ability to follow instructions and reason across modalities [12]. Group 4: Performance and Efficiency - MENTOR achieved competitive performance on DreamBench++, surpassing larger models like Emu2 (37 billion parameters) and DreamEngine (10.5 billion parameters) while maintaining a lower CP/PF ratio, indicating better balance between visual feature preservation and prompt following [15][17]. - The training process for MENTOR utilized approximately 3 million image-text pairs over 1.5 days, demonstrating significant efficiency compared to other baseline methods [18]. Group 5: Applications and Future Potential - MENTOR's framework is highly versatile, capable of handling various complex multimodal generation tasks with minimal adjustments [24]. - The article concludes that MENTOR opens a new path for controllable image generation tasks, showcasing the potential of AR models in visual generation, while acknowledging that there are still areas where it lags behind top-tier Diffusion models [26].
五倍推理加速,激发自回归潜能,苹果新工作让LLM预测未来
机器之心· 2025-07-24 04:08
Core Viewpoint - The article discusses the advancements in language models, particularly focusing on a new framework developed by Apple researchers that allows autoregressive models to perform multi-token predictions, significantly improving inference speed while maintaining generation quality [7][8][9]. Group 1: Advances in Language Models - Recent progress in language models is attributed to the availability of large-scale text data and the effectiveness of autoregressive training methods [2]. - Autoregressive models predict each token based on preceding context, which provides a clear advantage during training but incurs high computational costs during inference due to sequential execution [5][6]. Group 2: New Framework Development - Apple researchers have developed a framework that enables pre-trained autoregressive language models to execute multi-token predictions, achieving up to 5.35 times speedup for code and math tasks, and approximately 2.5 times for general tasks [7]. - This innovation allows for a significant reduction in AI operational costs and the potential for powerful real-time assistants to run smoothly on lightweight devices [9]. Group 3: Research Findings - The researchers confirmed that language models can generate multiple tokens in a single inference step, which is a promising development for speeding up generation processes [11]. - The study explored whether it is possible to train truly non-autoregressive language models, leading to the design of a training algorithm that minimally alters existing autoregressive frameworks while achieving efficient multi-token generation [13][14]. Group 4: Experimental Results - Experiments conducted on the Tulu3-8B model demonstrated that the proposed multi-token generation algorithm achieved speedups ranging from approximately 1.5 to 5.2 times across various tasks, with the most significant improvements observed in programming and math tasks [46]. - The introduction of mask tokens and a lightweight sampling module allowed the model to leverage its full depth and representational capabilities, resulting in superior performance compared to existing multi-token prediction methods [23][24]. Group 5: Future Directions - Future research could explore the applicability of this method during pre-training or downstream task adaptation phases to further assess its effectiveness [53]. - Another promising direction is the application of diffusion-based generation methods to multi-token prediction tasks, aiming to balance efficiency and quality [53].