Workflow
稀疏模型
icon
Search documents
DeepSeek开源大模型记忆模块!梁文锋署名新论文,下一代稀疏模型提前剧透
量子位· 2026-01-13 00:39
Core Insights - The article discusses the introduction of "Conditional Memory" in Transformer models, which enhances knowledge retrieval mechanisms that were previously lacking in the original architecture [1][2][9]. Group 1: Introduction of Conditional Memory - Conditional Memory is viewed as an essential modeling primitive for the next generation of sparse models [2]. - The research team, led by Liang Wenfeng in collaboration with Peking University, has proposed a new paradigm and implementation plan called the Engram module [3][5]. Group 2: Performance Improvements - The Engram module allows a 27B parameter model to outperform a pure MoE model of the same size, compressing tasks that originally required 6 layers of attention down to 1-2 layers, thus freeing resources for more complex reasoning tasks [5][13]. - The optimal allocation of sparse parameters between MoE and Engram memory results in a U-shaped curve, indicating that allocating about 20% to 25% of sparse parameters to Engram memory minimizes model validation loss [34][36]. Group 3: Technical Implementation - Engram's design incorporates a large vocabulary for static entities and phrases, enabling O(1) speed for information retrieval [7][14]. - The team addresses traditional N-gram model issues, such as semantic redundancy and storage explosion, by compressing tokens and using multiple hash functions to map N-grams to a fixed-size embedding table [22][25]. Group 4: Experimental Results - The Engram-27B model shows significant improvements across various benchmarks, with notable increases in performance metrics such as BBH, ARC-Challenge, and DROP [47]. - The model's architecture allows for efficient memory management, enabling the use of a 100 billion parameter table offloaded to CPU memory without significant latency impact during inference [63][66]. Group 5: Future Developments - The next generation of sparse models from DeepSeek is expected to be released before the Spring Festival, indicating ongoing advancements in AI model architecture [67].
腾讯研究院AI速递 20251216
腾讯研究院· 2025-12-15 16:22
Group 1: Manus 1.6 Release - Manus 1.6 Max has transitioned from an "auxiliary tool" to an "independent contractor," resulting in a 19.2% increase in user satisfaction, capable of independently completing complex Excel financial modeling and data analysis [1] - New mobile development features support end-to-end app development processes, allowing users to generate runnable iOS and Android applications simply by describing their needs [1] - The introduction of Design View allows for localized image editing, precise text rendering, and multi-layer composition, addressing the uncontrollable issues of AI-generated images [1] Group 2: OpenAI Circuit-Sparsity Model - OpenAI has released the Circuit-Sparsity model with only 0.4 billion parameters, enforcing 99.9% of weights to be zero, retaining only 0.1% non-zero weights, which addresses model interpretability issues [2] - The sparse model forms a compact and readable "circuit," reducing the scale by 16 times compared to dense models, although it operates 100 to 1000 times slower [2] - The research team proposed a "bridge network" solution to insert encoder-decoder pairs between sparse and dense models, enabling interpretable behavior editing of existing large models [2] Group 3: Thinking Machines Product Update - Thinking Machines, founded by former OpenAI CTO Mira Murati, has opened access to its Tinker product, an API for developers to fine-tune language models [3] - The update includes support for Kimi K2 Thinking fine-tuning (designed for long-chain reasoning) and Qwen3-VL visual input (available in 30B and 235B models) [3] - A new inference interface compatible with OpenAI API has been introduced, allowing users to easily integrate with any platform that supports OpenAI API, simplifying the post-training process for LLMs [3] Group 4: NotebookLM Integration with Gemini - NotebookLM has officially integrated with the Gemini system, allowing users to add NotebookLM notes as data sources for Q&A within Gemini conversations [4] - Gemini acts as a "hub" connecting multiple NotebookLM notes, resolving the issue of NotebookLM not supporting notebook merging, enabling simultaneous queries across multiple notes [4] - The content from NotebookLM can now be used alongside online information, facilitating a mixed analysis of "personal data + global information," integrating into Google's core AI product line [4] Group 5: Tongyi's Model Releases - Tongyi Bailing has upgraded the Fun-CosyVoice3 model, reducing initial latency by 50% and doubling the accuracy of mixed Chinese-English recognition, supporting 9 languages and 18 dialects for cross-lingual cloning and emotional control [5] - The Fun-ASR model achieves a 93% accuracy rate in noisy environments, supports lyrics and rap recognition, and covers 31 languages for free mixing, with the initial word latency reduced to 160ms [5][6] - The open-source Fun-CosyVoice3-0.5B provides zero-shot voice cloning capabilities, while the lightweight Fun-ASR-Nano-0.8B version offers lower inference costs [6] Group 6: Zoom's AI Claims - Zoom claims to have achieved a score of 48.1% on the "Human Last Exam" HLE benchmark, surpassing Google Gemini 3 Pro's score of 45.8% by 2.3 percentage points [7] - The company employs a "federated AI approach," combining its small language model with both open-source and closed-source models from OpenAI, Anthropic, and Google, using a Z-scorer scoring system for output selection [7] - This score has not appeared on the official HLE leaderboard, and on the same day, Sup AI announced a score of 52.15%, indicating Zoom's ambition to become the AI hub in enterprise workflows [7] Group 7: Gemini 3's CFA Exam Performance - Recent research indicates that reasoning models have passed all levels of the CFA exam, with Gemini 3.0 Pro achieving a historic high of 97.6% on Level 1 and GPT-5 leading Level 2 with 94.3% [8] - In Level 3, Gemini 2.5 Pro scored 86.4% on multiple-choice questions, while Gemini 3.0 Pro reached 92.0% on open-ended questions, showing significant improvement from previous years [8] - Experts caution that passing exams does not equate to practical capability, noting that AI struggles with ethical questions and cannot replace analysts' strategic thinking and client communication [8] Group 8: OpenEvidence Valuation Surge - OpenEvidence is undergoing a $250 million equity financing round, with a post-money valuation reaching $12 billion, doubling from its previous round two months ago [9] - The company generates revenue by selling advertising space for chatbots to pharmaceutical companies, with an annual advertising income of approximately $150 million, tripling since August, and a gross margin exceeding 90% [9] - An OffCall survey indicates that about 45% of U.S. doctors use OpenEvidence, answering approximately 20 million questions monthly, with its medical journal information being more accurate than general chatbots [9] Group 9: OpenAI's Sora Development Insights - OpenAI's development of the Android version of Sora was completed in just 28 days by a team of 4 engineers collaborating with the AI agent Codex, consuming around 5 billion tokens, with approximately 85% of the code generated by AI [10] - The team utilized an "exploration-validation-federation" workflow, allowing Codex to handle heavy coding tasks while engineers focused on architecture, user experience, and quality control, achieving a 99.9% crash-free rate [10] - Codex is responsible for 70% of OpenAI's internal PR weekly, capable of monitoring its training process and handling user feedback, creating a self-evolving model of "AI iterating AI" [10]
OpenAI又开源了,仅0.4B,给模型大瘦身
3 6 Ke· 2025-12-15 08:14
Core Insights - OpenAI has introduced a new model called Circuit-Sparsity, which aims to enhance the interpretability of AI models by creating a sparse architecture where 99.9% of the weights are zero, retaining only 0.1% of non-zero weights [1][2][11] Group 1: Model Characteristics - The Circuit-Sparsity model is a sparse Transformer that simplifies the internal workings of AI, addressing the "black box" nature of large language models (LLMs) [1][6] - The model's architecture allows for the formation of compact and readable "circuits," significantly reducing the complexity of decision-making processes within the model [11][18] - Compared to traditional dense models, the sparse model's circuit size is reduced by 16 times while maintaining similar task performance [11][13] Group 2: Technical Innovations - Key technical methods include dynamic pruning, activation sparsification, and architectural adjustments to maintain sparsity without compromising performance [10][11] - The model employs a new activation function, AbsTopK, to retain only the top 25% of activation values in critical areas, enhancing interpretability [10] Group 3: Performance and Limitations - Despite its advantages in interpretability, the sparse model is significantly slower, operating 100 to 1000 times slower than dense models due to computational efficiency bottlenecks [4][17] - OpenAI has proposed a "Bridges" network to facilitate interaction between sparse and dense models, allowing for modifications in the sparse model to be reflected in the dense model [17][18] Group 4: Future Directions - OpenAI plans to expand the application of this technology to larger models and further explore the logic behind various models' behaviors [18] - Future research will focus on extracting sparse circuits from existing dense models and developing more efficient training techniques for interpretable models [18]
OpenAI又Open了一下:发布可解释性新研究,作者来自Ilya超级对齐团队
量子位· 2025-11-15 02:08
Core Insights - OpenAI has introduced a new method for training smaller models that enhances interpretability, making the internal mechanisms of models easier for humans to understand [5][6][7] - The research focuses on creating sparse models with many neurons but fewer connections, simplifying neural networks for better comprehension [7][11] Summary by Sections Model Interpretability - OpenAI's language models have complex structures that are not fully understood, and the new method aims to bridge this gap [6] - The core idea is to train sparse models that maintain a high number of neurons while limiting their connections, making them simpler and more interpretable [7][11] Research Methodology - The researchers designed a series of simple algorithmic tasks to evaluate the model's interpretability, identifying the "circuit" for each task [13][18] - A "circuit" is defined as the smallest computational unit that allows the model to perform a specific task, represented as a graph of nodes and edges [15][16] Example of Circuit - An example task involves predicting the correct closing quote for a string in Python, demonstrating how the model can remember the type of opening quote to complete the string [19][22] Findings and Implications - The research indicates that larger, sparser models can produce increasingly powerful functions while maintaining simpler circuits [26] - This suggests potential for extending the method to understand more complex behaviors in models [27] Current Limitations - The study acknowledges that sparse models are significantly smaller than state-of-the-art models and still contain many "black box" elements [30] - Training efficiency for sparse models is currently low, with two proposed solutions: extracting sparse circuits from existing dense models or developing more efficient training techniques [31][32]
反直觉: MoE混合专家模型和场景没什么关系
理想TOP2· 2025-08-28 16:01
Core Viewpoint - The MoE (Mixture of Experts) model is fundamentally a sparse attention mechanism aimed at improving computational efficiency, rather than a model where each expert corresponds to a specific scenario [1][2]. Group 1: Scene Limitations - Having multiple MoE sub-models does not mean they can only handle specific scenes; it is impractical to train separate models for each scenario under the one model paradigm [1]. - If models are divided by scene, it does not represent a true MoE structure [1]. Group 2: Uniform Distribution - If only one type of scenario is run, a significant portion of the model's parameters may remain unused, leading to inefficiencies [2]. - It is more effective to distribute tasks evenly among experts rather than assigning specific experts to specific tasks, as low-usage experts may not justify their inclusion [2]. Group 3: Multiple Experts Activation - The MoE model can activate multiple experts simultaneously, allowing for a more even distribution of computational resources and addressing more complex problems effectively [2]. - The essence of the MoE model lies in the fact that only a small number of parameters significantly influence the output, making it a sparse model that enhances computational efficiency [2]. Group 4: Understanding the Model - Describing different experts as being suited for specific scenarios is a simplification that aids understanding, but it does not reflect the intentional design of the model [3].
Jeff Dean:一年内 AI 将取代初级工程师,网友:“Altman 只会画饼,Jeff 说的话才致命”
AI前线· 2025-05-28 05:17
Core Insights - Jeff Dean, a prominent figure in AI, predicts that within a year, AI systems capable of functioning like junior engineers will be available [1][15][16] - The conversation highlights the transformative potential of AI in software development and the broader implications for the job market [4][10] Group 1: AI Development and Trends - AI has been evolving for over a decade, with significant advancements in neural networks and machine learning since 2012 [5][6] - The mantra "larger models, more data, better results" has held true over the past 12 to 15 years, indicating a trend towards increasingly capable AI systems [6][8] - The emergence of multi-modal AI, capable of processing various input formats, is seen as a crucial trend in the industry [6][8] Group 2: AI Capabilities and Applications - AI agents are expected to perform tasks traditionally requiring human intervention, with a clear path for enhancing their capabilities through reinforcement learning [7][8] - The development of large models necessitates significant investment, leading to a market where only a few advanced models will survive [9][10] - The potential for AI to revolutionize education and other fields is highlighted, with examples of AI generating educational content from video inputs [11][12] Group 3: Hardware and Infrastructure - Specialized hardware for machine learning is critical, with Google’s TPU project being a significant development in this area [17][20] - The future of computing infrastructure is expected to adapt to the demands of running large-scale neural networks efficiently [22][23] - The distinction between training and inference workloads is emphasized, suggesting that different solutions may be required for each [23][24] Group 4: Future of AI Models - Sparse models, which utilize different parts of the model for specialized tasks, are viewed as a promising direction for future AI development [26][27] - The concept of dynamic scaling in models, allowing for the addition of new parameters and efficient resource allocation, is proposed as a more organic approach to AI learning [27][28]
Jeff Dean:一年内 AI 将取代初级工程师,网友:“Altman只会画饼,Jeff说的话才致命”
Xin Lang Cai Jing· 2025-05-18 22:46
Group 1 - Jeff Dean predicts that within a year, AI systems capable of operating 24/7 with "junior engineer" abilities will be available [1][14][15] - Dean emphasizes the significant advancements in AI, particularly in neural networks and their applications across various tasks since 2012 [4][6][7] - The evolution of AI is marked by improvements in algorithms and hardware, leading to larger models and enhanced capabilities [6][22] Group 2 - The industry is witnessing a potential transformation in the software development job market due to the rise of AI engineers who can outperform human engineers in certain tasks [4][8] - Dean discusses the importance of specialized hardware for machine learning, highlighting Google's TPU project and the need for efficient computation [16][19] - The future of AI models may involve sparse models that utilize different parts of the model for specialized tasks, enhancing efficiency significantly [24][25]