Workflow
混合专家模型(MoE)
icon
Search documents
EMNLP2025 | 通研院揭秘MoE可解释性,提升Context忠实性!
机器之心· 2025-11-15 06:23
Core Insights - The article discusses the integration of Mechanistic Interpretability with Mixture-of-Experts (MoE) models, highlighting the importance of understanding the underlying mechanisms to enhance model performance and explainability [4][5][6]. Group 1: Mechanistic Interpretability and MoE - There are many teams working on MoE models, but few focus on Mechanistic Interpretability, making this a rare and valuable area of research [4]. - The article proposes a method called "Router Lens & CEFT" aimed at improving context faithfulness in language models, which has been accepted for EMNLP 2025 [7][9]. - The research identifies experts within MoE models that are particularly adept at utilizing contextual information, termed "Context-Faithful Experts" [14][18]. Group 2: Context Faithfulness and Expert Specialization - Context faithfulness refers to the model's ability to generate responses based strictly on the provided context, avoiding irrelevant information [10]. - The study confirms the existence of context-faithful experts within MoE models, demonstrating that adjusting expert activation can significantly enhance context utilization [18][20]. - The Router Lens method is used to identify these experts by calibrating routing behavior to reflect their true capabilities [16]. Group 3: Performance Improvements and Efficiency - The CEFT method, which fine-tunes only the identified context-faithful experts, shows that it can achieve or exceed the performance of full parameter fine-tuning while significantly reducing the number of trainable parameters [41][44]. - The results indicate that CEFT requires training only 500 million parameters compared to 6.9 billion for full fine-tuning, achieving a 13.8 times reduction in parameter count [44]. - CEFT demonstrates superior resistance to catastrophic forgetting compared to full fine-tuning, as evidenced by performance metrics across various benchmarks [46]. Group 4: Future Applications and Research Directions - The Router Lens method can be applied to identify and analyze other types of experts, such as those specialized in reasoning or programming [50]. - It can also help in debugging MoE models by locating poorly performing or misleading experts [51]. - Combining Router Lens with other interpretability techniques could further enhance understanding of expert behavior and knowledge distribution within models [51].
破解MoE模型“规模越大,效率越低”困境!中科院自动化所提出新框架
量子位· 2025-10-11 01:15
Core Viewpoint - The article discusses a new research breakthrough from the Institute of Automation, Chinese Academy of Sciences, which addresses the challenges faced by large language models (LLMs) using a dynamic "group learning" approach to optimize the Mixture of Experts (MoE) framework, significantly reducing parameter count and improving efficiency [1][12]. Summary by Sections MoE Challenges - MoE has been a key method for expanding parameter size in LLMs while keeping computational costs linear, but it faces three main challenges: load imbalance, parameter redundancy, and communication overhead, which hinder its practical deployment [2][5]. - These challenges stem from hardware limitations, leading to fragmented optimization efforts that fail to address the underlying issues cohesively [6][8]. Research Findings - The research team discovered that experts activated by semantically similar inputs exhibit structural redundancy, providing a theoretical basis for a dynamic and structured organization of experts [10][11]. - The proposed framework allows for an 80% reduction in total parameter count, a 10%-20% increase in throughput, and a significant decrease in peak memory consumption, making it comparable to lightweight dense models [11][34]. Unified Framework - The framework formalizes the MoE optimization process as a unified mathematical problem, aiming to minimize task loss, load imbalance, parameter redundancy, and communication costs simultaneously [13]. - Four core technical components were designed to achieve this unified optimization: online dual similarity clustering, shared basis and low-rank residual compression, hierarchical routing, and heterogeneous precision with dynamic memory management [13][30]. Technical Components 1. **Online Dual Similarity Clustering**: This method dynamically reorganizes expert groups based on structural and functional similarities, addressing load imbalance issues [14][16]. 2. **Shared Basis and Low-Rank Residual Compression**: This approach reduces redundancy by sharing a common weight matrix among similar experts while representing unique characteristics with low-rank matrices [19][22]. 3. **Hierarchical Routing**: A two-stage routing strategy reduces computational complexity and communication overhead by first selecting clusters and then experts within those clusters [24][29]. 4. **Heterogeneous Precision and Dynamic Memory Management**: This strategy optimizes memory usage by employing different numerical precisions for various components and dynamically unloading inactive expert parameters from GPU memory [30][31]. Experimental Validation - Comprehensive experiments on standard NLP benchmarks demonstrated that the proposed framework maintains comparable model quality while achieving an approximately 80% reduction in total parameters and nearly 50% reduction in peak memory consumption compared to baseline models [34][36]. - Ablation studies confirmed the essential contributions of online clustering, low-rank compression, and hierarchical routing to the overall performance improvements [37].
不管是中国还是美国最终走向都是人工智能时代是这样吗?
Sou Hu Cai Jing· 2025-10-08 20:55
Core Insights - The development trajectories of China and the U.S. are clearly pointing towards the era of artificial intelligence, driven by technological iteration and industrial upgrading, but with significant differences in development paths and focus areas [1][3] Group 1: Technological Development - The U.S. maintains an advantage in foundational algorithms, large model architectures (e.g., original BERT framework), and core patent fields, focusing on fundamental breakthroughs in its research ecosystem [1] - China leverages its vast user base, mobile internet accumulation (e.g., mobile payments/e-commerce), and industrial chain collaboration to accelerate scenario-based applications, with some areas already surpassing the U.S. in user experience [1] Group 2: Policy and Strategic Approaches - The U.S. strategy aims to reinforce its technological hegemony through export controls, standard-setting, and collaboration with allies to curb competitors [3] - In contrast, China's approach focuses on leveraging its manufacturing foundation and data scale advantages, emphasizing the integration of AI with the real economy [3] Group 3: Competitive Landscape - Key differences in innovation focus: the U.S. prioritizes foundational theory and general large models, while China emphasizes scenario applications and engineering implementation [5] - Competitive advantages differ as well: the U.S. excels in academic originality and global standard leadership, whereas China leads in commercialization speed and market scale [5] Group 4: Future Competition Focus - The competition between the two nations will center around three main technological lines: the proliferation of agents, cost reduction and efficiency enhancement through mixed expert models (MoE), and the creation of incremental markets through multimodal integration [7] - China's 5-8 year lead gained during the mobile internet era may provide a crucial springboard for competition in AI applications [7]
冲破 AGI 迷雾,蚂蚁看到了一个新路标
雷峰网· 2025-09-16 10:20
Core Viewpoint - The article discusses the current state of large language models (LLMs) and the challenges they face in achieving Artificial General Intelligence (AGI), emphasizing the need for new paradigms beyond the existing autoregressive (AR) models [4][10][18]. Group 1: Current Challenges in AI Models - Ilya, a prominent AI researcher, warns that data extraction has reached its limits, hindering the progress towards AGI [2][4]. - The existing LLMs often exhibit significant performance discrepancies, with some capable of outperforming human experts while others struggle with basic tasks [13][15]. - The autoregressive model's limitations include a lack of bidirectional modeling and the inability to correct errors during generation, leading to fundamental misunderstandings in tasks like translation and medical diagnosis [26][27][18]. Group 2: New Directions in AI Research - Elon Musk proposes a "purified data" approach to rewrite human knowledge as a potential pathway to AGI [5]. - Researchers are exploring multimodal approaches, with experts like Fei-Fei Li emphasizing the importance of visual understanding as a cornerstone of intelligence [8]. - A new paradigm, the diffusion model, is being introduced by young scholars, which contrasts with the traditional autoregressive approach by allowing for parallel decoding and iterative correction [12][28]. Group 3: Development of LLaDA-MoE - The LLaDA-MoE model, based on diffusion theory, was announced as a significant advancement in the field, showcasing a new approach to language modeling [12][66]. - LLaDA-MoE has a total parameter count of 7 billion, with 1.4 billion activated parameters, and has been trained on approximately 20 terabytes of data, demonstrating its scalability and stability [66][67]. - The model's performance in benchmark tests indicates that it can compete with existing autoregressive models, suggesting a viable alternative path for future AI development [67][71]. Group 4: Future Prospects and Community Involvement - The development of LLaDA-MoE represents a milestone in the exploration of diffusion models, with plans for further scaling and improvement [72][74]. - The team emphasizes the importance of community collaboration in advancing the diffusion model research, similar to the development of autoregressive models [74][79]. - Ant Group's commitment to investing in AGI research reflects a strategic shift towards exploring innovative and potentially high-risk areas in AI [79].
字节跳动:2025年思考模型Seed-Thinking-v1.5技术报告
Sou Hu Cai Jing· 2025-08-22 09:20
Core Insights - ByteDance has introduced Seed1.5-Thinking, a state-of-the-art reasoning model with 20 billion activated parameters and a total of 200 billion parameters, demonstrating exceptional reasoning capabilities across various benchmarks [1][5][60] - The model achieved scores of 86.7 on AIME 2024, 55.0 on Codeforces, and 77.3 on GPQA, showcasing its strengths in STEM and coding tasks while also exhibiting strong generalization abilities in non-reasoning tasks [1][5][49] Model Performance - Seed1.5-Thinking matches OpenAI's o3-mini-high in AIME 2024 but still lags behind in AIME 2025 and BeyondAIME challenges [2][49] - In the GPQA task, Seed1.5-Thinking's performance is close to o3-level, achieving a score of 77.3% [49] - The model outperforms DeepSeek R1 by 8% in overall user preference in non-reasoning tasks, indicating its broader applicability [1][5][51] Development Aspects - The development of Seed1.5-Thinking focuses on three key areas: training data, reinforcement learning (RL) algorithms, and RL infrastructure [10][12][60] - The training data includes a mix of STEM problems, coding tasks, and logic reasoning, with a strong emphasis on chain-of-thought data for supervised fine-tuning [10][15][23] - The RL training employs innovative frameworks like VAPO and DAPO to address instability issues, ensuring robust training trajectories [12][10] Infrastructure and Efficiency - The model utilizes a hybrid engine architecture and a Streaming Rollout System (SRS) to enhance training efficiency and scalability [2][42][44] - The SRS architecture allows for dynamic adjustments in sample ratios and optimizes memory usage, significantly improving training speed [43][44] Future Directions - The team plans to explore more efficient RL methods and tackle more complex tasks, aiming to push the boundaries of the model's intelligence [2][60] - Upcoming releases will include internal benchmarks like BeyondAIME and Codeforces to support further research in the field [2][5]
从GPT-2到gpt-oss,深度详解OpenAI开放模型的进化之路
机器之心· 2025-08-18 05:15
Core Insights - OpenAI has released its first open-weight models, gpt-oss-120b and gpt-oss-20b, since the launch of GPT-2 in 2019, which can run locally due to optimizations [4][5] - The article provides a detailed analysis of the architectural advancements from GPT-2 to gpt-oss and compares it with Qwen3 [4][5] Model Architecture Overview - gpt-oss-20b can run on consumer-grade GPUs with 16 GB RAM, while gpt-oss-120b requires a single H100 processor with 80 GB RAM or more [10] - The architecture of gpt-oss models appears conventional, as leading LLM developers often use similar foundational architectures with minor adjustments [10][11] Changes Since GPT-2 - The article highlights significant changes from GPT-2, including the removal of Dropout, the adoption of RoPE for positional encoding, and the replacement of GELU with Swish/SwiGLU [20][22][29] - The introduction of Mixture of Experts (MoE) models allows for increased parameter capacity while maintaining efficiency by activating only a subset of experts for each token [39] - Grouped Query Attention (GQA) is introduced as a more efficient alternative to Multi-Head Attention (MHA) [41] - Sliding window attention is applied in gpt-oss to reduce memory usage and computational costs [47] - RMSNorm replaces LayerNorm for better efficiency in large-scale LLMs [52] Comparison with Qwen3 - gpt-oss-20b has a wider architecture with more attention heads, while Qwen3 has a deeper architecture with more transformer modules [69][70] - gpt-oss uses fewer but larger experts compared to Qwen3, which has more smaller experts [72] - Both models utilize grouped query attention, but gpt-oss incorporates sliding window attention to limit context size [82] Additional Insights - gpt-oss models are designed for inference, allowing users to control inference workload easily [93] - The training compute for gpt-oss is estimated at 2.1 million H100 GPU hours, comparable to other large models [92] - The MXFP4 optimization allows gpt-oss models to run on a single GPU, enhancing accessibility [98] - Benchmark results indicate that gpt-oss performs comparably to proprietary models, although it has not yet been extensively tested [101][106]
赛道Hyper | 追平全球顶级:千问3推理模型开源
Hua Er Jie Jian Wen· 2025-08-06 08:06
Core Insights - Alibaba has launched the Qianwen 3 inference model, which is the first in the Qianwen series to adopt a Mixture of Experts (MoE) architecture, featuring a total of 480 billion parameters and supporting a context length of 256K tokens, expandable to 1 million tokens, significantly enhancing programming efficiency [1][3][5] Group 1: Model Performance and Features - The MoE architecture allows for efficient performance by utilizing specialized experts for tasks, ensuring both efficiency and the ability to handle complex requirements [1][3] - The Qianwen 3 model shows significant performance improvements in knowledge retention, programming capabilities, and mathematical operations, comparable to top proprietary models like Gemini-2.5 pro and o4-mini [1][3] - The model excels in handling long documents and multi-turn dialogues, reducing the risk of losing critical information [3][4] Group 2: Market Impact and Applications - The open-sourcing of these models has attracted attention from both developers and business decision-makers, potentially reshaping the application landscape in the AI field [2][6] - The Qianwen 3 model has been recognized for its outstanding performance in various assessments, including knowledge coverage and code accuracy, outperforming models like Claude4 [4][5] - The API call volume for Alibaba's Qianwen has surged, exceeding 100 billion tokens, indicating its popularity among developers, especially small to medium-sized teams [6][7] Group 3: Ecosystem and Integration - The integration of AI models with cloud products enhances customer engagement and deepens the use of Alibaba Cloud services, driving demand for GPU resources and IaaS [7][8] - The Qianwen models are being utilized across various industries, including education and finance, to create personalized solutions and conduct risk assessments [6][10] - The open-source nature of these models lowers the barrier for small enterprises to adopt AI technologies, promoting a diverse AI open-source community globally [6][7]
DeepSeek再出手!R1升级版性能大提升,美国对手慌了?
Jin Shi Shu Ju· 2025-05-30 03:52
Core Insights - DeepSeek's R1 model has undergone a minor version upgrade, enhancing semantic understanding, complex logical reasoning, and long text processing stability [1] - The upgraded model shows significant improvements in understanding capabilities and programming skills, capable of generating over 1000 lines of error-free code [1] - The R1 model's cost-effectiveness is highlighted, being priced at 1/11 of Claude-3.7-Sonnet and 1/277 of GPT-4.5, while being open-source for commercial use [1] Group 1 - The R1 model has gained global attention since its January release, outperforming Western competitors and causing a drop in tech stocks [2] - Following the release of the V3 model, interest in DeepSeek has shifted towards the anticipated R2 model, which is expected to utilize a mixture of experts model with 1.2 trillion parameters [2] - The latest version R1-0528 has sparked renewed media interest, showcasing competitive performance against OpenAI's models in code generation [2] Group 2 - DeepSeek's low-cost, high-performance R1 model has positively influenced the Chinese tech stock market and reflects optimistic market expectations regarding China's AI capabilities [2] - The upgrade has also shown improvements in reducing hallucinations, indicating that DeepSeek is not only catching up but competing with top models [1]
中金 • 联合研究 | AI十年展望(二十三):AI+陪伴:技术降本×场景升维,提供深度情绪价值
中金点睛· 2025-05-29 23:39
Core Viewpoint - AI companionship applications are rapidly emerging and gaining popularity, with significant market potential and user demand, particularly among younger demographics [2][7][8]. Group 1: Market Overview - The global AI companionship market is projected to reach approximately $30 million in 2023, with potential growth to $70 billion and $150 billion by 2030 under baseline and optimistic scenarios, respectively, reflecting a CAGR of 200% and 236% from 2024 to 2030 [7]. - Monthly active users (MAU) of AI companionship products have increased nearly 30 times from under 500,000 to about 15 million between 2018 and 2023, outpacing the growth rates of social media and online gaming [7][8]. Group 2: User Demographics and Needs - The primary user base for AI companionship applications consists of younger individuals seeking emotional support, entertainment, and efficiency improvements [2][8]. - Users exhibit a higher tolerance for AI imperfections in companionship scenarios compared to productivity applications, where accuracy is paramount [8]. Group 3: Technological Innovations - The use of mixed expert models (MoE) has significantly reduced costs and improved efficiency in AI dialogue scenarios, enabling better user experiences [16][18]. - Advances in long-text capabilities and linear attention mechanisms are expected to enhance user interactions by allowing for more coherent and contextually relevant conversations [23][24]. - Multi-modal capabilities, including image, audio, and video generation, are becoming essential for enriching user experiences and increasing engagement [27][30]. Group 4: Application Landscape - Notable AI companionship applications include Replika, Character.AI, MiniMax's Talkie, and others, each focusing on different aspects such as emotional support, interactive content, and user-generated content [3][41][44]. - Character.AI has emerged as a leader in the market, achieving a peak MAU of 22 million by August 2024, driven by its strong technical foundation and user engagement strategies [36][37]. Group 5: Future Directions - The industry is expected to explore hardware integration to enhance user experiences, particularly in educational and gaming contexts, targeting broader demographics including children and the elderly [64][65]. - The potential for AI companionship applications to evolve into comprehensive content platforms, akin to TikTok or Xiaohongshu, is being discussed, with a focus on user engagement and emotional connections [59][60].
DeepSeek R1模型完成“小版本试升级”,编程、逻辑理解上了一个层次!
华尔街见闻· 2025-05-29 00:57
Core Viewpoint - DeepSeek has released an updated version of its R1 model, enhancing its capabilities in semantic understanding, complex logical reasoning, and long text processing stability, amidst escalating competition in the AI sector [1][2]. Group 1: Model Enhancements - The R1 model has significantly improved its understanding capabilities, with user feedback indicating a notable increase in performance, particularly in activating parameters and presenting key information logically [3]. - Programming capabilities have also seen a substantial upgrade, with users reporting the ability to generate over 1000 lines of code without bugs [4]. - The R1 model is now considered competitive with Claude 4, a leading programming model [5]. Group 2: Previous Model Performance - Earlier this year, DeepSeek released the DeepSeek-V3-0324 model, which outperformed Claude-3.7-Sonnet in various assessments, particularly in mathematics and coding tasks, and was noted for its strong performance in reasoning tasks despite being a non-reasoning model [6]. - The cost-effectiveness of the R1 model is highlighted, being priced at only 1/11 of Claude-3.7-Sonnet and 1/277 of GPT-4.5, while also being open-source and free for commercial use [7]. Group 3: Market Impact - The emergence of the R1 model has led to a decline in global tech stocks, as investors question the necessity of significant investments by companies like Microsoft in developing advanced AI models and services [8]. Group 4: Future Developments - There is ongoing speculation regarding the release of the R2 model, which is expected to enhance code generation capabilities and reasoning in multiple languages. Initial plans for its release were set for early May [9]. - The R2 model is anticipated to utilize a more advanced mixture of experts model, with a total parameter count projected to reach 1.2 trillion, significantly reducing reasoning costs compared to GPT-4 [10]. - Despite the speculation, DeepSeek has not officially confirmed any details regarding the R2 model's release timeline [11].