Workflow
DeepSeek V3
icon
Search documents
听说,大家都在梭后训练?最佳指南来了
机器之心· 2025-10-09 02:24
Core Insights - The article emphasizes the shift in focus from pre-training to post-training in large language models (LLMs), highlighting the diminishing returns of scaling laws as model sizes reach hundreds of billions of parameters [2][3][11]. Group 1: Importance of Post-Training - Post-training is recognized as a crucial phase for enhancing the reasoning capabilities of models like OpenAI's series, DeepSeek R1, and Google Gemini, marking it as a necessary step towards advanced intelligence [3][11]. - The article introduces various innovative post-training methods such as Reinforcement Learning from Human Feedback (RLHF), Reinforcement Learning from AI Feedback (RLAIF), and Reinforcement Learning with Verifiable Rewards (RLVR) [2][3][12]. Group 2: Transition from Pre-Training to Post-Training - The evolution from pre-training to instruction fine-tuning is discussed, where foundational models are trained on large datasets to predict the next token, but often lack practical utility in real-world applications [7][8]. - Post-training aims to align model behavior with user expectations, focusing on quality over quantity in the datasets used, which are typically smaller but more refined compared to pre-training datasets [11][24]. Group 3: Supervised Fine-Tuning (SFT) - Supervised Fine-Tuning (SFT) is described as a process that transforms a pre-trained model into one that can follow user instructions effectively, relying on high-quality instruction-answer pairs [21][24]. - The quality of the SFT dataset is critical, as even a small number of low-quality samples can negatively impact the model's performance [25][26]. Group 4: Reinforcement Learning Techniques - Reinforcement Learning (RL) is highlighted as a complex yet effective method for model fine-tuning, with various reward mechanisms such as RLHF, RLAIF, and RLVR being employed to enhance model performance [39][41]. - The article outlines the importance of reward models in RLHF, which are trained using human preference data to guide model outputs [44][46]. Group 5: Evaluation of Post-Training Models - The evaluation of post-training models is multifaceted, requiring a combination of automated and human assessments to capture various quality aspects [57][58]. - Automated evaluations are cost-effective and quick, while human evaluations provide a more subjective quality measure, especially for nuanced tasks [59][60].
DeepSeek V3.2要来了?
Guan Cha Zhe Wang· 2025-09-29 09:58
目前Hugging Face相关界面已经显示报错,截止发稿时DeepSeek官方尚未对此有所回应。 2024年12月27日,也就是元旦前发布了 DeepSeek V3。 本文系观察者网独家稿件,未经授权,不得转载。 2025年5月28日,端午节前发布了 DeepSeek-R1-0528,官方称其为端午节特别献礼。 9月29日,在开源社区平台Hugging Face上,DeepSeek-V3.2的页面一度出现,引发网友猜测。 DeepSeek最新一次更新在一周前的9月22日,DeepSeek在其官方API平台发布了DeepSeek-V3.1-Terminus模型,并宣布模型开源,同时公布了开源 版本下载地址。 根据观察者网整理,DeepSeek有在节前一天发布新版本和更新的历史。 ...
谁说Scaling Law到头了?新研究:每一步的微小提升会带来指数级增长
3 6 Ke· 2025-09-16 07:46
Core Insights - The Scaling Law is being questioned due to perceived diminishing returns in model training, but recent research suggests that small improvements in accuracy can lead to exponential growth in task completion length, which may hold more economic value in real-world applications [1][2][4] Group 1: Research Findings - A recent paper from Cambridge University indicates that while there are diminishing returns in metrics like test loss, the real-world value of large language models (LLMs) often comes from their ability to complete longer tasks [2][4] - The paper highlights that the long-term execution of tasks has been a significant weakness in deep learning, with LLMs struggling to perform complex, lengthy tasks despite improvements in reasoning capabilities [4][6] - The authors propose that the failures in long tasks are primarily due to execution challenges rather than reasoning or planning limitations, emphasizing the need for more focus on execution capabilities in LLM research [6][20] Group 2: Experimental Insights - The study measures LLMs' long-horizon execution capabilities by isolating execution from planning and knowledge retrieval, revealing that larger models can significantly increase the number of successful execution rounds [6][23][25] - The concept of self-conditioning is introduced, where the model's performance deteriorates as it builds on its previous errors, leading to a decline in accuracy over multiple rounds [8][26][30] - The research shows that while increasing model size improves task execution, it does not alleviate the self-conditioning effect, which remains a challenge for LLMs in long-term tasks [27][30] Group 3: Implications for Investment - The findings suggest that the economic value of LLMs may not be accurately reflected in short-task benchmarks, as the ability to complete longer tasks is a more reliable indicator of their potential [18][20] - The paper encourages further investment in scaling models, as the ability to perform longer tasks could justify continued financial commitment despite short-term performance metrics suggesting stagnation [10][18] - The research calls for the design of new benchmarks that better assess the execution depth of models, highlighting a potential area for future investment and development in the AI sector [10][18]
谁说Scaling Law到头了?新研究:每一步的微小提升会带来指数级增长
机器之心· 2025-09-16 04:01
Core Viewpoint - The article discusses the ongoing debate regarding the diminishing returns of scaling models in AI, particularly in the context of large language models (LLMs). It presents a new perspective that, despite slower improvements in single-step accuracy, these incremental gains can lead to exponential growth in task completion length, which may hold greater economic value in real-world applications [1][3]. Group 1: Scaling Law and Economic Value - The scaling law indicates that while there may be diminishing returns in metrics like test loss, the real-world value of LLMs often comes from their ability to complete longer tasks. Larger models can compound small improvements in single-step accuracy, resulting in exponential increases in task length [3][6]. - The paper titled "The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs" argues that the economic value of an AI agent is derived from the length of tasks it can complete, rather than short task benchmarks that may suggest stagnation in progress [5][19]. Group 2: Long-Horizon Execution Challenges - Long-term task execution has historically been a significant weakness for deep learning models. The paper highlights that while LLMs have improved in complex reasoning tasks, they still struggle with executing longer tasks reliably [6][11]. - The authors propose that failures in long-term execution are often misattributed to reasoning or planning deficiencies, when in fact, execution remains a critical and under-researched challenge [7][22]. Group 3: Self-Conditioning Effect - The study identifies a self-conditioning effect where the error rate in long tasks increases with each step, leading to a compounding effect of mistakes. This phenomenon contrasts with human performance, where practice typically leads to improvement [9][30]. - The authors found that larger models do not necessarily mitigate the self-conditioning effect, which can lead to a decline in performance over extended tasks [29][32]. Group 4: Impact of Thinking Models - Recent thinking models have shown the ability to correct for self-conditioning limitations, allowing for significantly longer task execution in single rounds. For instance, the GPT-5 thinking version can execute over 1000 steps, far surpassing competitors [10][36]. - The research emphasizes the importance of reasoning before action, as models that utilize thinking chains can perform better in executing longer tasks compared to those that do not [36][37]. Group 5: Experimental Insights - The experiments conducted reveal that increasing model size significantly enhances the number of rounds a model can successfully execute, demonstrating a clear scaling trend [27][28]. - The findings suggest that while larger models can improve task execution, they still face challenges due to self-conditioning, which remains a critical area for future research [29][37].
GPT-5 为啥不 “胡说” 了?OpenAI 新论文讲透了
腾讯研究院· 2025-09-12 08:58
Core Viewpoint - The article discusses the advancements and challenges of OpenAI's GPT-5, particularly focusing on the significant reduction in hallucination rates compared to previous models, while also highlighting the underlying mechanisms and implications of these changes [5][6][25]. Group 1: Hallucination Rates and Mechanisms - GPT-5 has a hallucination rate that is approximately 45% lower than GPT-4 and about 80% lower than OpenAI's earlier models [6]. - The reduction in hallucination rates is attributed to enhanced reinforcement learning techniques that allow models to refine their reasoning processes and recognize their errors [8][9]. - The paper published by OpenAI indicates that hallucinations are an inevitable byproduct of the statistical learning nature of language models, making it more challenging to generate reliable information than to assess its reliability [12][16]. Group 2: Theoretical Framework - OpenAI introduces a theoretical "Is-It-Valid" (IIV) judgment mechanism that determines the validity of generated sentences based on their internal probabilities [13]. - The model's tendency to generate plausible-sounding but incorrect information is exacerbated by data sparsity, complexity, and noise in training data [14][16]. - The mathematical conclusion presented in the paper suggests that the error rate of generative models is at least double that of the IIV judgment errors, indicating a compounding effect of judgment mistakes on hallucinations [15][16]. Group 3: Post-Training Challenges - Post-training processes have not effectively mitigated hallucinations, as current evaluation metrics tend to reward models for providing confident but potentially incorrect answers [18][24]. - The article critiques the binary scoring systems used in mainstream AI evaluations, which penalize uncertainty and discourage models from expressing "I don't know" [21][24]. - The reinforcement learning processes that utilize binary reward paths may inadvertently promote overconfidence in models, leading to increased hallucination rates [27][29]. Group 4: Future Directions and Solutions - The article suggests that introducing a penalty-based scoring mechanism during post-training could help models better calibrate their confidence levels and reduce hallucinations [33]. - A shift from a score-optimization focus to a truth-oriented approach is proposed as a potential solution to the hallucination problem [34].
Nano Banana爆火之后,一个神秘的「胡萝卜」代码模型又上线了
机器之心· 2025-09-05 04:31
Core Viewpoint - The article discusses the trend of naming AI models after fruits and vegetables, highlighting how creative names can enhance the visibility and appeal of these models, with examples from OpenAI, Google, and other companies [2][4][6]. Group 1: Naming Trends - OpenAI initiated a trend by naming its model "Strawberry," which sparked widespread discussion among users [2]. - Following this, other companies like Recraft and Google adopted similar naming conventions, with models like "red_panda" and "Nano Banana" gaining popularity [4]. - The latest addition to this trend is a model named "Carrot," which is noted for its strong coding capabilities [5][6]. Group 2: Model Capabilities - The "Carrot" model, developed by Anycoder, showcases impressive programming abilities, such as creating games where carrots act as projectiles [10]. - Other notable models mentioned include DeepSeek V3, Gemini 2.5 Pro, Grok-4, and GPT-5, indicating a competitive landscape in AI model development [8]. - The article highlights user-generated content that demonstrates the capabilities of these models, such as generating animations and interactive applications [14][18]. Group 3: Speculations and Community Engagement - The community is actively speculating about the origins of the "Carrot" model, with guesses ranging from Google to Alibaba's Qwen3 series [19][21]. - The article encourages readers to engage in discussions about the identity of the "Carrot" model, reflecting a vibrant community interest in AI developments [22].
人工智能行业专题:探究模型能力与应用的进展和边界
Guoxin Securities· 2025-08-25 13:15
Investment Rating - The report maintains an "Outperform" rating for the artificial intelligence industry [2] Core Insights - The report focuses on the progress and boundaries of model capabilities and applications, highlighting the differentiated development of overseas models and the cost-effectiveness considerations of enterprises [4][5] - Interest recommendation has emerged as the most significant application scenario for AI empowerment, particularly in advertising and gaming industries [4][6] - The competitive relationship between models and application enterprises is explored through five typical scenarios, indicating a shift in market dynamics [4][6] Summary by Sections Model Development and Market Share - Overseas models, particularly those from Google and Anthropic, dominate the market with significant shares due to their competitive pricing and advanced capabilities [9][10] - Domestic models are making steady progress, with no significant technological gaps observed among various players [9][10] Application Scenarios - Interest recommendation in advertising has shown substantial growth, with companies like Meta, Reddit, Tencent, and Kuaishou leveraging AI technologies to enhance ad performance [4][6] - The gaming sector, exemplified by platforms like Roblox, has also benefited from AI-driven recommendation algorithms, leading to increased exposure for new games [4][6] Competitive Dynamics - The report identifies five scenarios illustrating the competition between large models and traditional products, emphasizing the transformative impact of AI on existing business models [4][6] - The analysis suggests that AI products may replace traditional revenue streams, while also enhancing operational efficiency in areas like programming and customer service [4][6] Investment Recommendations - The report recommends investing in Tencent Holdings (0700.HK), Kuaishou (1024.HK), Alibaba (9988.HK), and Meitu (1357.HK) due to their potential for performance release driven by enhanced model capabilities [4]
实测DeepSeek V3.1:不止拓展上下文长度
自动驾驶之心· 2025-08-21 23:34
Core Viewpoint - The article discusses the differences between DeepSeek V3.1 and its predecessor V3, highlighting improvements in programming capabilities, creative writing, translation quality, and response tone. Group 1: Model Comparison - DeepSeek V3.1 has extended its context length to 128K tokens, compared to V3's 65K tokens, allowing for more comprehensive responses [10] - The new version shows significant enhancements in various tasks, including programming, creative writing, translation, and knowledge application [3][4] Group 2: Programming Capability - In a programming test, V3.1 provided a more comprehensive solution for compressing GIF files, considering more factors and providing detailed usage instructions [12][13][14] - The performance of V3.1 was notably faster in executing the task compared to V3 [18] Group 3: Creative Writing - For a creative writing task based on a high school essay prompt, V3.1 produced a more poetic and emotional response, contrasting with V3's more straightforward style [22] Group 4: Translation Quality - In translating a scientific abstract, V3.1 demonstrated a better understanding of complex sentences, although it missed translating a simple word, indicating room for improvement [30] Group 5: Knowledge Application - Both versions provided answers to a niche question about a specific fruit type, with V3.1 showing some inconsistencies in terminology and relevance [31][37] Group 6: Performance Metrics - V3.1 achieved a score of 71.6% on the Aider benchmark, outperforming Claude Opus 4 while being significantly cheaper [43] - On the SVGBench, V3.1 was noted as the best variant among its peers, although it still did not surpass the best open models [44] Group 7: User Feedback - Users have reported various observations regarding the new features and performance of V3.1, including improvements in physical understanding and the introduction of new tokens [45][47]
实测DeepSeek V3.1,不止拓展上下文长度
量子位· 2025-08-20 07:48
Core Insights - The article discusses the differences between DeepSeek V3.1 and its predecessor V3, highlighting improvements in programming performance, creative writing, translation quality, and response tone [2][6][40]. Group 1: Model Features - DeepSeek V3.1 has expanded context length to 128K tokens, while V3 has a maximum of 65K tokens [8][7]. - The new version supports multiple tensor formats, enhancing its usability across different platforms [1][6]. - The API for V3 still operates with a maximum context length of 65K tokens, indicating a significant upgrade in V3.1 [7][8]. Group 2: Performance Comparison - In programming tasks, V3.1 demonstrated a more comprehensive approach, providing detailed code and usage instructions compared to V3 [12][13]. - For creative writing, V3.1 produced a more poetic and emotional response, contrasting with V3's straightforward style [20][18]. - Both versions successfully solved a mathematical problem, but their presentation styles differed, with V3.1 offering a clearer explanation [23][24]. Group 3: Translation Capabilities - V3.1 showed improved understanding of complex sentences in translation tasks, although it missed translating some simple words [29][26]. - The translation of a biology paper's abstract revealed V3.1's enhanced capability in handling specialized terminology compared to V3 [28][27]. Group 4: Knowledge and Reasoning - In a knowledge-based query about a specific fruit type, both versions identified it as a drupe, but V3.1's reasoning strayed off-topic [30][36]. - V3.1 achieved a score of 71.6% on the Aider benchmark, outperforming V3 and indicating its competitive edge in non-reasoning tasks [42][40]. Group 5: User Feedback and Market Response - The release of V3.1 has generated significant interest, becoming a trending topic on social media platforms [40][41]. - Users have noted improvements in physical understanding and the introduction of new tokens, although some issues related to the online API have been reported [45][49].
万字解析DeepSeek MOE架构!
自动驾驶之心· 2025-08-14 23:33
Core Viewpoint - The article provides a comprehensive overview of the Mixture of Experts (MoE) architecture, particularly focusing on the evolution and implementation of DeepSeek's MoE models (V1, V2, V3) and their optimizations in handling token distribution and load balancing in AI models [2][21][36]. Group 1: MoE Architecture Overview - MoE, or Mixture of Experts, is a model architecture that utilizes multiple expert networks to enhance performance, particularly in sparse settings suitable for cloud computing [2][3]. - The initial interest in MoE architecture surged with the release of Mistral.AI's Mixtral model, which highlighted the potential of sparse architectures in AI [2][3]. - The Switch Transformer model introduced a routing mechanism that allows tokens to select the top-K experts, optimizing the processing of diverse knowledge [6][10]. Group 2: DeepSeek V1 Innovations - DeepSeek V1 addresses two main issues in existing MoE practices: knowledge mixing and redundancy, which hinder expert specialization [22][24]. - The model introduces fine-grained expert division and shared experts to enhance specialization and reduce redundancy, allowing for more efficient knowledge capture [25][26]. - The architecture includes a load balancing mechanism to ensure even distribution of tokens across experts, mitigating training inefficiencies [32]. Group 3: DeepSeek V2 Enhancements - DeepSeek V2 builds on V1's design, implementing three key optimizations focused on load balancing [36]. - The model limits the number of devices used for routing experts to reduce communication overhead, enhancing efficiency during training and inference [37]. - A new communication load balancing loss function is introduced to ensure equitable token distribution across devices, further optimizing performance [38]. Group 4: DeepSeek V3 Developments - DeepSeek V3 introduces changes in the MOE layer computation, replacing the softmax function with a sigmoid function to improve computational efficiency [44]. - The model eliminates auxiliary load balancing losses, instead using a learnable bias term to control routing, which enhances load balancing during training [46]. - A sequence-level auxiliary loss is added to prevent extreme imbalances within individual sequences, ensuring a more stable training process [49].