混合专家模型(MoE)

Search documents
破解MoE模型“规模越大,效率越低”困境!中科院自动化所提出新框架
量子位· 2025-10-11 01:15
下面详细来看—— 一套统一框架直击MoE底层运作模式 随着LLM参数规模的持续扩张,模型规模增长与计算效率优化难以协同推进的核心挑战逐渐显现,混合专家模型(MoE)作为一种稀疏激活架 构,为模型规模的持续扩展提供了理论上极具吸引力的技术途径。 中科院自动化所团队 投稿 量子位 | 公众号 QbitAI 大模型参数量飙升至千亿、万亿级,却陷入"规模越大,效率越低" 困境? 中科院自动化所新研究给出破局方案—— 首次让MoE专家告别"静态孤立",开启动态"组队学习" 。 具体而言,MoE本是大语言模型(LLM)实现参数量扩张且计算成本仅呈线性增长的核心路径,却长期受困于负载失衡、参数冗余、通信开销 的"三难困境",成为大模型落地部署的主要瓶颈。 而中科院自动化所的研究团队通过专家集群动态重组,不仅让大模型总参数量 直降80% ,负载方差 降低至原来的三分之一 ,消耗内存更 直 逼轻量级传统稠密模型 ,更一举达成通信延迟、负载均衡、内存占用的三重优化,为大参数LLM的低成本部署提供了新路径。 例如,负载均衡损失函数是一种被动的补偿机制;参数压缩技术(如MoE-Lite)虽减少了参数,却将专家视为独立的实体,忽视了其 ...
不管是中国还是美国最终走向都是人工智能时代是这样吗?
Sou Hu Cai Jing· 2025-10-08 20:55
中美两国的发展轨迹已明确指向人工智能时代,这是技术迭代和产业升级的必然趋势,但发展路径和侧 重点存在显著差异:技术发展格局、基础创新方面:美国在基础算法、大模型架构(如BERT原始框 架)及核心专利领域保持优势,其科研生态更注重底层突破。应用落地方面:中国依托庞大的用户基 数、移动互联网积淀(如移动支付/电商)及产业链协同,在场景化应用(如智能体Agent、多模态交 互)推进速度更快,部分领域体验已超越美国。例如:微信AI助手"元宝"实现社交生态无缝集成,腾讯 豆包模型推理能力跻身全球第一梯队;智能体技术正突破场景边界,加速行业自动化进程。 产业生态与政策驱动;美国战略:强化技术霸权主导地位,通过出口管制、标准制定及盟友合作遏制竞 争者。2025年新政策主张放松监管、推动开源,旨在巩固"黄金时代"领导力,但政治化倾向明显(如褒 扬特朗普政策、贬抑拜登监管)。中国路径:发挥制造业根基与数据规模优势,聚焦"AI+实体产业"融 合。张亚勤指出,中国将在5年内成为全球最大AI应用国,核心推力来自成熟的移动生态延续性及产业 链协同效应 核心差异与未来竞争焦点 维度美国中国 创新重心基础理论、通用大模型 场景应用、工程化 ...
冲破 AGI 迷雾,蚂蚁看到了一个新路标
雷峰网· 2025-09-16 10:20
Core Viewpoint - The article discusses the current state of large language models (LLMs) and the challenges they face in achieving Artificial General Intelligence (AGI), emphasizing the need for new paradigms beyond the existing autoregressive (AR) models [4][10][18]. Group 1: Current Challenges in AI Models - Ilya, a prominent AI researcher, warns that data extraction has reached its limits, hindering the progress towards AGI [2][4]. - The existing LLMs often exhibit significant performance discrepancies, with some capable of outperforming human experts while others struggle with basic tasks [13][15]. - The autoregressive model's limitations include a lack of bidirectional modeling and the inability to correct errors during generation, leading to fundamental misunderstandings in tasks like translation and medical diagnosis [26][27][18]. Group 2: New Directions in AI Research - Elon Musk proposes a "purified data" approach to rewrite human knowledge as a potential pathway to AGI [5]. - Researchers are exploring multimodal approaches, with experts like Fei-Fei Li emphasizing the importance of visual understanding as a cornerstone of intelligence [8]. - A new paradigm, the diffusion model, is being introduced by young scholars, which contrasts with the traditional autoregressive approach by allowing for parallel decoding and iterative correction [12][28]. Group 3: Development of LLaDA-MoE - The LLaDA-MoE model, based on diffusion theory, was announced as a significant advancement in the field, showcasing a new approach to language modeling [12][66]. - LLaDA-MoE has a total parameter count of 7 billion, with 1.4 billion activated parameters, and has been trained on approximately 20 terabytes of data, demonstrating its scalability and stability [66][67]. - The model's performance in benchmark tests indicates that it can compete with existing autoregressive models, suggesting a viable alternative path for future AI development [67][71]. Group 4: Future Prospects and Community Involvement - The development of LLaDA-MoE represents a milestone in the exploration of diffusion models, with plans for further scaling and improvement [72][74]. - The team emphasizes the importance of community collaboration in advancing the diffusion model research, similar to the development of autoregressive models [74][79]. - Ant Group's commitment to investing in AGI research reflects a strategic shift towards exploring innovative and potentially high-risk areas in AI [79].
字节跳动:2025年思考模型Seed-Thinking-v1.5技术报告
Sou Hu Cai Jing· 2025-08-22 09:20
Core Insights - ByteDance has introduced Seed1.5-Thinking, a state-of-the-art reasoning model with 20 billion activated parameters and a total of 200 billion parameters, demonstrating exceptional reasoning capabilities across various benchmarks [1][5][60] - The model achieved scores of 86.7 on AIME 2024, 55.0 on Codeforces, and 77.3 on GPQA, showcasing its strengths in STEM and coding tasks while also exhibiting strong generalization abilities in non-reasoning tasks [1][5][49] Model Performance - Seed1.5-Thinking matches OpenAI's o3-mini-high in AIME 2024 but still lags behind in AIME 2025 and BeyondAIME challenges [2][49] - In the GPQA task, Seed1.5-Thinking's performance is close to o3-level, achieving a score of 77.3% [49] - The model outperforms DeepSeek R1 by 8% in overall user preference in non-reasoning tasks, indicating its broader applicability [1][5][51] Development Aspects - The development of Seed1.5-Thinking focuses on three key areas: training data, reinforcement learning (RL) algorithms, and RL infrastructure [10][12][60] - The training data includes a mix of STEM problems, coding tasks, and logic reasoning, with a strong emphasis on chain-of-thought data for supervised fine-tuning [10][15][23] - The RL training employs innovative frameworks like VAPO and DAPO to address instability issues, ensuring robust training trajectories [12][10] Infrastructure and Efficiency - The model utilizes a hybrid engine architecture and a Streaming Rollout System (SRS) to enhance training efficiency and scalability [2][42][44] - The SRS architecture allows for dynamic adjustments in sample ratios and optimizes memory usage, significantly improving training speed [43][44] Future Directions - The team plans to explore more efficient RL methods and tackle more complex tasks, aiming to push the boundaries of the model's intelligence [2][60] - Upcoming releases will include internal benchmarks like BeyondAIME and Codeforces to support further research in the field [2][5]
从GPT-2到gpt-oss,深度详解OpenAI开放模型的进化之路
机器之心· 2025-08-18 05:15
Core Insights - OpenAI has released its first open-weight models, gpt-oss-120b and gpt-oss-20b, since the launch of GPT-2 in 2019, which can run locally due to optimizations [4][5] - The article provides a detailed analysis of the architectural advancements from GPT-2 to gpt-oss and compares it with Qwen3 [4][5] Model Architecture Overview - gpt-oss-20b can run on consumer-grade GPUs with 16 GB RAM, while gpt-oss-120b requires a single H100 processor with 80 GB RAM or more [10] - The architecture of gpt-oss models appears conventional, as leading LLM developers often use similar foundational architectures with minor adjustments [10][11] Changes Since GPT-2 - The article highlights significant changes from GPT-2, including the removal of Dropout, the adoption of RoPE for positional encoding, and the replacement of GELU with Swish/SwiGLU [20][22][29] - The introduction of Mixture of Experts (MoE) models allows for increased parameter capacity while maintaining efficiency by activating only a subset of experts for each token [39] - Grouped Query Attention (GQA) is introduced as a more efficient alternative to Multi-Head Attention (MHA) [41] - Sliding window attention is applied in gpt-oss to reduce memory usage and computational costs [47] - RMSNorm replaces LayerNorm for better efficiency in large-scale LLMs [52] Comparison with Qwen3 - gpt-oss-20b has a wider architecture with more attention heads, while Qwen3 has a deeper architecture with more transformer modules [69][70] - gpt-oss uses fewer but larger experts compared to Qwen3, which has more smaller experts [72] - Both models utilize grouped query attention, but gpt-oss incorporates sliding window attention to limit context size [82] Additional Insights - gpt-oss models are designed for inference, allowing users to control inference workload easily [93] - The training compute for gpt-oss is estimated at 2.1 million H100 GPU hours, comparable to other large models [92] - The MXFP4 optimization allows gpt-oss models to run on a single GPU, enhancing accessibility [98] - Benchmark results indicate that gpt-oss performs comparably to proprietary models, although it has not yet been extensively tested [101][106]
赛道Hyper | 追平全球顶级:千问3推理模型开源
Hua Er Jie Jian Wen· 2025-08-06 08:06
Core Insights - Alibaba has launched the Qianwen 3 inference model, which is the first in the Qianwen series to adopt a Mixture of Experts (MoE) architecture, featuring a total of 480 billion parameters and supporting a context length of 256K tokens, expandable to 1 million tokens, significantly enhancing programming efficiency [1][3][5] Group 1: Model Performance and Features - The MoE architecture allows for efficient performance by utilizing specialized experts for tasks, ensuring both efficiency and the ability to handle complex requirements [1][3] - The Qianwen 3 model shows significant performance improvements in knowledge retention, programming capabilities, and mathematical operations, comparable to top proprietary models like Gemini-2.5 pro and o4-mini [1][3] - The model excels in handling long documents and multi-turn dialogues, reducing the risk of losing critical information [3][4] Group 2: Market Impact and Applications - The open-sourcing of these models has attracted attention from both developers and business decision-makers, potentially reshaping the application landscape in the AI field [2][6] - The Qianwen 3 model has been recognized for its outstanding performance in various assessments, including knowledge coverage and code accuracy, outperforming models like Claude4 [4][5] - The API call volume for Alibaba's Qianwen has surged, exceeding 100 billion tokens, indicating its popularity among developers, especially small to medium-sized teams [6][7] Group 3: Ecosystem and Integration - The integration of AI models with cloud products enhances customer engagement and deepens the use of Alibaba Cloud services, driving demand for GPU resources and IaaS [7][8] - The Qianwen models are being utilized across various industries, including education and finance, to create personalized solutions and conduct risk assessments [6][10] - The open-source nature of these models lowers the barrier for small enterprises to adopt AI technologies, promoting a diverse AI open-source community globally [6][7]
DeepSeek再出手!R1升级版性能大提升,美国对手慌了?
Jin Shi Shu Ju· 2025-05-30 03:52
Core Insights - DeepSeek's R1 model has undergone a minor version upgrade, enhancing semantic understanding, complex logical reasoning, and long text processing stability [1] - The upgraded model shows significant improvements in understanding capabilities and programming skills, capable of generating over 1000 lines of error-free code [1] - The R1 model's cost-effectiveness is highlighted, being priced at 1/11 of Claude-3.7-Sonnet and 1/277 of GPT-4.5, while being open-source for commercial use [1] Group 1 - The R1 model has gained global attention since its January release, outperforming Western competitors and causing a drop in tech stocks [2] - Following the release of the V3 model, interest in DeepSeek has shifted towards the anticipated R2 model, which is expected to utilize a mixture of experts model with 1.2 trillion parameters [2] - The latest version R1-0528 has sparked renewed media interest, showcasing competitive performance against OpenAI's models in code generation [2] Group 2 - DeepSeek's low-cost, high-performance R1 model has positively influenced the Chinese tech stock market and reflects optimistic market expectations regarding China's AI capabilities [2] - The upgrade has also shown improvements in reducing hallucinations, indicating that DeepSeek is not only catching up but competing with top models [1]
中金 • 联合研究 | AI十年展望(二十三):AI+陪伴:技术降本×场景升维,提供深度情绪价值
中金点睛· 2025-05-29 23:39
Core Viewpoint - AI companionship applications are rapidly emerging and gaining popularity, with significant market potential and user demand, particularly among younger demographics [2][7][8]. Group 1: Market Overview - The global AI companionship market is projected to reach approximately $30 million in 2023, with potential growth to $70 billion and $150 billion by 2030 under baseline and optimistic scenarios, respectively, reflecting a CAGR of 200% and 236% from 2024 to 2030 [7]. - Monthly active users (MAU) of AI companionship products have increased nearly 30 times from under 500,000 to about 15 million between 2018 and 2023, outpacing the growth rates of social media and online gaming [7][8]. Group 2: User Demographics and Needs - The primary user base for AI companionship applications consists of younger individuals seeking emotional support, entertainment, and efficiency improvements [2][8]. - Users exhibit a higher tolerance for AI imperfections in companionship scenarios compared to productivity applications, where accuracy is paramount [8]. Group 3: Technological Innovations - The use of mixed expert models (MoE) has significantly reduced costs and improved efficiency in AI dialogue scenarios, enabling better user experiences [16][18]. - Advances in long-text capabilities and linear attention mechanisms are expected to enhance user interactions by allowing for more coherent and contextually relevant conversations [23][24]. - Multi-modal capabilities, including image, audio, and video generation, are becoming essential for enriching user experiences and increasing engagement [27][30]. Group 4: Application Landscape - Notable AI companionship applications include Replika, Character.AI, MiniMax's Talkie, and others, each focusing on different aspects such as emotional support, interactive content, and user-generated content [3][41][44]. - Character.AI has emerged as a leader in the market, achieving a peak MAU of 22 million by August 2024, driven by its strong technical foundation and user engagement strategies [36][37]. Group 5: Future Directions - The industry is expected to explore hardware integration to enhance user experiences, particularly in educational and gaming contexts, targeting broader demographics including children and the elderly [64][65]. - The potential for AI companionship applications to evolve into comprehensive content platforms, akin to TikTok or Xiaohongshu, is being discussed, with a focus on user engagement and emotional connections [59][60].
DeepSeek R1模型完成“小版本试升级”,编程、逻辑理解上了一个层次!
华尔街见闻· 2025-05-29 00:57
Core Viewpoint - DeepSeek has released an updated version of its R1 model, enhancing its capabilities in semantic understanding, complex logical reasoning, and long text processing stability, amidst escalating competition in the AI sector [1][2]. Group 1: Model Enhancements - The R1 model has significantly improved its understanding capabilities, with user feedback indicating a notable increase in performance, particularly in activating parameters and presenting key information logically [3]. - Programming capabilities have also seen a substantial upgrade, with users reporting the ability to generate over 1000 lines of code without bugs [4]. - The R1 model is now considered competitive with Claude 4, a leading programming model [5]. Group 2: Previous Model Performance - Earlier this year, DeepSeek released the DeepSeek-V3-0324 model, which outperformed Claude-3.7-Sonnet in various assessments, particularly in mathematics and coding tasks, and was noted for its strong performance in reasoning tasks despite being a non-reasoning model [6]. - The cost-effectiveness of the R1 model is highlighted, being priced at only 1/11 of Claude-3.7-Sonnet and 1/277 of GPT-4.5, while also being open-source and free for commercial use [7]. Group 3: Market Impact - The emergence of the R1 model has led to a decline in global tech stocks, as investors question the necessity of significant investments by companies like Microsoft in developing advanced AI models and services [8]. Group 4: Future Developments - There is ongoing speculation regarding the release of the R2 model, which is expected to enhance code generation capabilities and reasoning in multiple languages. Initial plans for its release were set for early May [9]. - The R2 model is anticipated to utilize a more advanced mixture of experts model, with a total parameter count projected to reach 1.2 trillion, significantly reducing reasoning costs compared to GPT-4 [10]. - Despite the speculation, DeepSeek has not officially confirmed any details regarding the R2 model's release timeline [11].
华为盘古首次露出,昇腾原生72B MoE架构,SuperCLUE千亿内模型并列国内第一
华尔街见闻· 2025-05-29 00:57
Core Insights - The emergence of the Mixture of Grouped Experts (MoGE) model by Huawei's Pangu team addresses the inefficiencies of traditional Mixture of Experts (MoE) models, ensuring balanced computational load across devices while maintaining high performance [1][7][27] - The Pangu Pro MoE model, with 72 billion total parameters and 16 billion active parameters, achieves competitive performance in the industry, ranking first among models with less than 100 billion parameters in China [2][22] Group 1: Model Architecture and Efficiency - The MoGE architecture introduces a grouping mechanism that ensures balanced expert activation, significantly improving computational efficiency and reducing system bottlenecks [1][6][12] - The model demonstrates superior throughput, achieving 321 tokens/s on the Ascend 300I Duo platform and 1528 tokens/s on the Ascend 800I A2 platform, outperforming similar-sized dense models [18][26] Group 2: Performance Metrics - In the latest SuperCLUE ranking, Pangu Pro MoE scored 58.75, showcasing its strong capabilities in various reasoning tasks and outperforming other models in complex reasoning scenarios [3][22] - The model exhibits excellent performance across multiple benchmarks, including English and Chinese language tasks, demonstrating its versatility and adaptability in complex cognitive tasks [22][23][24] Group 3: Industry Impact - The introduction of Pangu Pro MoE signifies a shift in the AI industry from a focus on parameter quantity to practical application, enabling efficient cloud inference and supporting high-concurrency real-time scenarios [27] - Huawei's innovations in the MoE architecture redefine the value of large models, providing a robust foundation for AI applications across various industries [27]