Workflow
自回归模型
icon
Search documents
小模型层数好玄学:12/32/64层效果好,16/24/48/层效果糟
量子位· 2026-01-11 04:02
一水 发自 凹非寺 量子位 | 公众号 QbitAI 小模型身上的"秘密"这下算是被扒光了! 知名开源项目OpenEvolve作者,刚刚用一篇长文揭示了 70M小模型 的几个重要发现: 其一,架构的重要性远低于大家的想象。 相比之下,模型"形状" (深度-宽度比) 更重要。 其二,小模型层数也存在"玄学" ,12/32/64层效果好,16/24/48/层效果糟,而且最佳层数为32。 当然了,作者还解密了这一"层数玄学"的背后原因—— "隐藏维度"是否大于等于512 。 上述结论一出,社区里迅速刮起了一股讨论之风,大家还与作者进行了各种互动: 别急,咱这就详细看看—— 发现小模型层数存在"玄学" 开始之前,简单介绍下作者 Asankhaya Sharma 。 他最为人熟知的成就主要包括:1)在很多人还主要围绕模型规模、参数量和训练方法打转时,他率先关注到了大语言模型的"推理时计算", 并以唯一作者的身份发表了一篇论文进行详细叙述;2)开源了OptiLLM、OpenEvolve、Adaptive Classifier等一众知名项目。 | | Overview Packages Repositories 121 P ...
VLA-Arena:一个用于系统性评估VLA的开源基准框架
具身智能之心· 2025-12-31 00:50
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Borong Zhang等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 一、 研究背景与动机 Vision-Language-Action模型(VLAs)正快速向通用机器人策略演进,已实现跨载体泛化、灵巧操作、指令跟随等多种能力。但当前对这类模型的能力边界、局 限及失效模式缺乏定量理解——现有基准存在三大核心缺陷: 为解决这些问题,VLA-Arena作为全面、可复现的基准框架被提出,核心目标是通过系统化设计,精准刻画VLA模型的能力前沿与失效机制。 二、 核心设计:结构化任务与基准框架 2.2 任务套件设计 基准包含170个任务,按核心挑战分为四个维度,每个维度下的任务均覆盖L0-L2难度: 三、 关键组件与技术细节 3.1 CBDDL语言扩展 基于BDDL(Behavior Domain Definition Language)扩展得到约束行为域定义语言(CBDDL),核心增强两点 ...
跳过“逐字生成”,蚂蚁集团赵俊博:扩散模型让我们能直接修改Token
3 6 Ke· 2025-12-12 07:17
Core Insights - The main focus of the news is on the emerging diffusion architecture for language models, which offers advantages over traditional autoregressive models in terms of speed and computational efficiency [1][4][20]. Group 1: Diffusion Architecture Advantages - Diffusion architecture allows for direct modification and control of tokens during inference, eliminating the need to regenerate entire segments of content as required by autoregressive models [1][5]. - The newly released LLaDA 2.0 model has achieved a scale of 100 billion parameters, marking a significant milestone in the development of diffusion language models [1][20]. - Diffusion models are described as "data-hungry," requiring larger datasets for training compared to autoregressive models, but they can absorb data more quickly [5][8]. Group 2: Technical Developments - The LLaDA model employs a "fill-in-the-blank" prediction method, which contrasts with the sequential token generation of autoregressive models [6][8]. - The architecture includes both global and causal attention mechanisms to enhance computational efficiency and maintain coherence in generated sequences [16]. - The research team has made significant strides in addressing architectural challenges, including the integration of mixture of experts (MoE) within the diffusion framework [19]. Group 3: Industry Impact and Future Directions - Major tech companies, including Google and ByteDance, are actively exploring diffusion models, indicating a growing interest in this technology [1][19]. - The development of a new inference engine, dInfer, is expected to enhance the performance of diffusion models, with potential for significant speed improvements in key applications [24][25]. - The community is encouraged to collaborate in building the ecosystem around diffusion language models, which are still in the early stages of development [27].
跳过“逐字生成”!蚂蚁集团赵俊博:扩散模型让我们能直接修改Token | MEET2026
量子位· 2025-12-12 03:00
Core Viewpoint - The article discusses the shift from autoregressive models to diffusion architecture in language models, highlighting the potential for faster generation speeds and lower computational costs with diffusion models [2][8]. Group 1: Diffusion Architecture Insights - Diffusion architecture allows for direct modification and control of tokens during inference, unlike autoregressive models that require re-generating entire segments [2][15]. - The recent release of LLaDA 2.0 marks a significant milestone, achieving a scale of 100 billion parameters for diffusion language models [4][44]. - The development of diffusion models is still in its early stages, but it has attracted attention from major companies like Google and ByteDance, as well as several startups [5][41]. Group 2: Technical Aspects and Comparisons - Diffusion models operate on a "fill-in-the-blank" mechanism rather than a sequential token generation, which can lead to more efficient data utilization [12][21]. - In terms of parameter efficiency, diffusion models can achieve similar performance with fewer parameters compared to autoregressive models under the same computational constraints [15][23]. - The unique characteristics of diffusion models allow for continuous training, unlike autoregressive models that plateau after several epochs [24][26]. Group 3: Future Directions and Community Engagement - The article emphasizes the need for further exploration of the scaling laws specific to diffusion language models, which differ from those of autoregressive models [56]. - The community is encouraged to participate in the development and optimization of diffusion models, as the ecosystem is still in its infancy [56]. - Upcoming collaborations and API releases are planned to enhance accessibility and integration of diffusion models into various applications [51].
速递|斯坦福教授创业,Inception获5000万美元种子轮融资,用扩散模型解锁实时AI应用
Z Potentials· 2025-11-07 02:12
Core Insights - The article discusses the current surge of funding into AI startups, highlighting it as a golden period for AI researchers to validate their ideas [1] - Inception, a startup developing diffusion-based AI models, recently raised $50 million in seed funding, led by Menlo Ventures with participation from several notable investors [2] Company Overview - Inception is focused on developing diffusion models, which generate outputs through iterative optimization rather than sequential generation [3] - The project leader, Stefano Ermon, has been researching diffusion models prior to the recent AI boom and aims to apply these models to a broader range of tasks [3] Technology and Innovation - Inception has released a new version of its Mercury model, designed specifically for software development, which has been integrated into various development tools [3] - Ermon claims that diffusion-based models will significantly optimize two critical metrics: latency and computational cost, stating that these models are faster and more efficient than those built by other companies [3][5] - Diffusion models differ structurally from autoregressive models, which dominate text-based AI services, and are believed to perform better when handling large volumes of text or data limitations [5] Performance Metrics - The diffusion models exhibit greater flexibility in hardware utilization, which is increasingly important as AI's infrastructure demands grow [5] - Ermon's benchmarks indicate that the models can process over 1,000 tokens per second, surpassing the capabilities of existing autoregressive technologies due to their inherent support for parallel processing [5]
上海AI Lab发布混合扩散语言模型SDAR:首个突破6600 tgs的开源扩散语言模型
机器之心· 2025-11-01 04:22
Core Insights - The article introduces a new paradigm called SDAR (Synergistic Diffusion-AutoRegression) that addresses the slow inference speed and high costs associated with large model applications, which are primarily due to the serial nature of autoregressive (AR) models [2][3][4]. Group 1: SDAR Paradigm - SDAR effectively decouples training and inference, combining the high performance of AR models with the parallel inference advantages of diffusion models, allowing for low-cost transformation of any AR model into a parallel decoding model [4][11]. - Experimental results show that SDAR not only matches but often surpasses the performance of original AR models across multiple benchmarks, achieving up to a 12.3 percentage point advantage in complex scientific reasoning tasks [6][28]. Group 2: Performance and Efficiency - SDAR maintains the performance of AR models while significantly improving inference speed and reducing costs, demonstrating that larger models benefit more from parallelization without sacrificing performance [17][19]. - The research indicates that SDAR can be adapted to any mainstream AR model at a low cost, achieving comparable or superior performance in downstream tasks [19][29]. Group 3: Experimental Validation - The study conducted rigorous experiments to compare SDAR's performance with AR models, confirming that SDAR can achieve substantial speed improvements in real-world applications, with SDAR-8B-chat showing a 2.3 times acceleration over its AR counterpart [23][20]. - The results highlight that SDAR's unique generation mechanism does not compromise its complex reasoning capabilities, retaining long-chain reasoning abilities and excelling in tasks requiring understanding of structured information [28][29]. Group 4: Future Implications - SDAR represents a significant advancement in the field of large models, providing a powerful and flexible tool that lowers application barriers and opens new avenues for exploring higher performance and efficiency in AI reasoning paradigms [29][31].
视觉生成的另一条路:Infinity 自回归架构的原理与实践
AI前线· 2025-10-31 05:42
Core Insights - The article discusses the significant advancements in visual autoregressive models, particularly highlighting the potential of these models in the context of AI-generated content (AIGC) and their competitive edge against diffusion models [2][4][11]. Group 1: Visual Autoregressive Models - Visual autoregressive models (VAR) utilize a "coarse-to-fine" approach, starting with low-resolution images and progressively refining them to high-resolution outputs, which aligns more closely with human visual perception [12][18]. - The VAR model architecture includes an improved VQ-VAE that employs a hierarchical structure, allowing for efficient encoding and reconstruction of images while minimizing token usage [15][30]. - VAR has demonstrated superior image generation quality compared to existing models like DiT, showcasing a robust scaling curve that indicates performance improvements with increased model size and computational resources [18][49]. Group 2: Comparison with Diffusion Models - Diffusion models operate by adding Gaussian noise to images and then training a network to reverse this process, maintaining the original resolution throughout [21][25]. - The key advantages of VAR over diffusion models include higher training parallelism and a more intuitive process that mimics human visual cognition, although diffusion models can correct errors through iterative refinement [27][29]. - VAR's approach allows for faster inference times, with the Infinity model achieving significant speed improvements over comparable diffusion models [46][49]. Group 3: Innovations in Tokenization and Error Correction - The Infinity framework introduces a novel "bitwise tokenizer" that enhances reconstruction quality while allowing for a larger vocabulary size, thus improving detail and instruction adherence in generated images [31][41]. - A self-correction mechanism is integrated into the training process, enabling the model to learn from previous errors and significantly reducing cumulative error during inference [35][40]. - The findings indicate that larger models benefit from larger vocabularies, reinforcing the reliability of scaling laws in model performance [41][49].
从300多篇工作中,看VLA在不同场景下的应用和实现......
具身智能之心· 2025-09-25 04:00
Core Insights - The article discusses the emergence of Vision Language Action (VLA) models, marking a shift in robotics from traditional strategy-based control to a more generalized robotic technology paradigm, enabling active decision-making in complex environments [2][5][20] - It emphasizes the integration of large language models (LLMs) and vision-language models (VLMs) to enhance robotic operations, providing greater flexibility and precision in task execution [6][12] - The survey outlines a clear classification system for VLA methods, categorizing them into autoregressive, diffusion, reinforcement learning, hybrid, and specialized methods, while also addressing the unique contributions and challenges within each category [7][10][22] Group 1: VLA Model Overview - VLA models represent a significant advancement in robotics, allowing for the unification of perception, language understanding, and executable control within a single modeling framework [15][20] - The article categorizes VLA methods into five paradigms: autoregressive, diffusion, reinforcement learning, hybrid, and specialized, detailing their design motivations and core strategies [10][22][23] - The integration of LLMs into VLA systems transforms them from passive input parsers to semantic intermediaries, enhancing their ability to handle long and complex tasks [29][30] Group 2: Applications and Challenges - VLA models have practical applications across various robotic forms, including robotic arms, quadrupeds, humanoid robots, and autonomous vehicles, showcasing their deployment in diverse scenarios [8][20] - The article identifies key challenges in the VLA field, such as data limitations, reasoning speed, and safety concerns, which need to be addressed to accelerate the development of VLA models and general robotic technology [8][19][20] - The reliance on high-quality datasets and simulation platforms is crucial for the effective training and evaluation of VLA models, addressing issues of data scarcity and real-world testing risks [16][19] Group 3: Future Directions - The survey outlines future research directions for VLA, including addressing data limitations, enhancing reasoning speed, and improving safety measures to facilitate the advancement of general embodied intelligence [8][20][21] - It highlights the importance of developing scalable and efficient VLA models that can adapt to various tasks and environments, emphasizing the need for ongoing innovation in this rapidly evolving field [20][39] - The article concludes by underscoring the potential of VLA models to bridge the gap between perception, understanding, and action, positioning them as a key frontier in embodied artificial intelligence [20][21][39]
深度综述 | 300+论文带你看懂:纯视觉如何将VLA推向自动驾驶和具身智能巅峰!
自动驾驶之心· 2025-09-24 23:33
Core Insights - The emergence of Vision Language Action (VLA) models signifies a paradigm shift in robotics from traditional strategy-based control to general-purpose robotic technology, transforming Vision Language Models (VLMs) from passive sequence generators to active agents capable of executing operations and making decisions in complex, dynamic environments [1][5][11] Summary by Sections Introduction - Robotics has historically relied on pre-programmed instructions and control strategies for task execution, primarily in simple, repetitive tasks [5] - Recent advancements in AI and deep learning have enabled the integration of perception, detection, tracking, and localization technologies, leading to the development of embodied intelligence and autonomous driving [5] - Current robots often operate as "isolated agents," lacking effective interaction with humans and external environments, prompting researchers to explore the integration of Large Language Models (LLMs) and VLMs for more precise and flexible robotic operations [5][6] Background - The development of VLA models marks a significant step towards general embodied intelligence, unifying visual perception, language understanding, and executable control within a single modeling framework [11][16] - The evolution of VLA models is supported by breakthroughs in single-modal foundational models across computer vision, natural language processing, and reinforcement learning [13][16] VLA Models Overview - VLA models have rapidly developed due to advancements in multi-modal representation learning, generative modeling, and reinforcement learning [24] - The core design of VLA models includes the integration of visual encoding, LLM reasoning, and decision-making frameworks, aiming to bridge the gap between perception, understanding, and action [23][24] VLA Methodologies - VLA methods are categorized into five paradigms: autoregressive, diffusion models, reinforcement learning, hybrid methods, and specialized approaches, each with distinct design motivations and core strategies [6][24] - Autoregressive models focus on sequential generation of actions based on historical context and task instructions, demonstrating scalability and robustness [26][28] Applications and Resources - VLA models are applicable in various robotic domains, including robotic arms, quadrupedal robots, humanoid robots, and wheeled robots (autonomous vehicles) [7] - The development of VLA models heavily relies on high-quality datasets and simulation platforms to address challenges related to data scarcity and high risks in real-world testing [17][21] Challenges and Future Directions - Key challenges in the VLA field include data limitations, reasoning speed, and safety concerns, which need to be addressed to accelerate the development of VLA models and general robotic technologies [7][18] - Future research directions are outlined to enhance the capabilities of VLA models, focusing on improving data diversity, enhancing reasoning mechanisms, and ensuring safety in real-world applications [7][18] Conclusion - The review emphasizes the need for a clear classification system for pure VLA methods, highlighting the significant features and innovations of each category, and providing insights into the resources necessary for training and evaluating VLA models [9][24]
扩散语言模型也有MoE版本了!蚂蚁&人大从头训练LLaDA-MoE,即将完全开源
机器之心· 2025-09-12 11:31
Core Viewpoint - The article discusses the development of the LLaDA-MoE model, the first native MoE architecture diffusion language model trained from scratch, which demonstrates significant performance and efficiency advantages over traditional autoregressive models [2][15][18]. Group 1: Model Development and Performance - The LLaDA-MoE model was trained on 20 terabytes of data and features 1.4 billion active parameters, achieving performance comparable to denser autoregressive models like Qwen2.5-3B while maintaining faster inference speeds [15][17][29]. - The LLaDA series has rapidly evolved, with LLaDA-MoE being a notable milestone, surpassing previous models like LLaDA1.0/1.5 and Dream-7B in various benchmark tests [13][18][29]. - The model's architecture allows for significant scaling potential, with plans to explore higher sparsity ratios and larger MoE diffusion language models [29][40]. Group 2: Technical Innovations and Advantages - The diffusion model approach allows for parallel decoding, bidirectional modeling, and iterative correction, addressing limitations of autoregressive models such as serial bottlenecks and lack of error correction capabilities [38][40]. - Evidence suggests that diffusion language models can achieve better learning outcomes than autoregressive models, particularly in scenarios with limited data, demonstrating a data utilization efficiency that can exceed three times that of autoregressive models [40][41]. - The training framework and infrastructure developed by Ant Group, including the ATorch framework, supports the efficient training of large-scale MoE models [25][26]. Group 3: Strategic Vision and Future Directions - The development of LLaDA-MoE reflects a strategic choice to explore high-potential areas in AI, moving beyond established paths to enhance the limits of intelligence [44][47]. - Ant Group's commitment to innovation is evident in its previous projects and ongoing research in areas like dynamic MoE architectures and hybrid linear architectures, all aimed at achieving general artificial intelligence (AGI) [45][46][47].