Workflow
Qwen3
icon
Search documents
代码里插广告,腾讯 Codebuddy 们 “背锅”?DeepSeek “极你太美”事件,其他模型也逃不掉?
3 6 Ke· 2025-08-27 07:44
随后,发现 Codebuddy 问题的网友在评论区表示,"是 DeepSeek 模型引入的 bug,腾讯已经把问题上报了,后续会修复。" 无论是 Codebuddy 还是 Trae,出现问题的根源都指向了 DeepSeek 最新的 V3.1。 实际上,一天前,开发者 notdba 就在 Reddit 上表示,其用 DeepSeek V3.1 做了一些测试,发现该模型会在完全意想不到的地方生成以下 token: "一开始我以为是因为我用了极端的 IQ1_S 量化,或者是 imatrix 校准数据集里的某些边缘情况导致的。但后来我用 Fireworks 提供的 FP8 全精度模型测试 时,也出现了同样的问题。"notdba 表示,这些极端 token 还会不断地在其他出乎意料的地方以第二或第三选择的形式出现。 示例 1:(本地 ik_llama.cpp,参数 top_k=1,temperature=1) 预期输出:time.Second 昨天,有网友在社交媒体发帖称,在开发 UI 时检查腾讯 Codebuddy 改写的内容,发现有一串广告写进去了:往函数里面赋值了一个极速电竞 APP。"忍 不了了,直接卸载"该网 ...
人工智能行业专题:探究模型能力与应用的进展和边界
Guoxin Securities· 2025-08-25 13:15
2025年08月25日 证券研究报告 | 人工智能行业专题(11) 探究模型能力与应用的进展和边界 行业研究 · 行业专题 互联网 · 互联网II 投资评级:优于大市(维持) 证券分析师:张伦可 证券分析师:陈淑媛 证券分析师:刘子谭 证券分析师:张昊晨 0755-81982651 021-60375431 liuzitan@guosen.com.cn zhanghaochen1@guosen.com.cn zhanglunke@guosen.com.cn chenshuyuan@guosen.com.cn S0980525060001 S0980525010001 S0980521120004 S0980524030003 请务必阅读正文之后的免责声明及其项下所有内容 报告摘要 Ø 风险提示:宏观经济波动风险、广告增长不及预期风险、行业竞争加剧风险、AI技术进展不及预期风险等。 请务必阅读正文之后的免责声明及其项下所有内容 2 Ø 本篇报告主要针对海内外模型发展、探究模型能力与应用的进展和边界。我们认为当前海外模型呈现差异化发展,企业调用考虑性价比。当前 OpenAI在技术路径上相对领先,聚焦强化推理与专业 ...
英伟达开源9B参数小模型,比Qwen3快6倍
量子位· 2025-08-19 05:25
Core Insights - The article discusses the emergence of small AI models, highlighting the launch of NVIDIA's new small language model, Nemotron Nano v2, which is designed to perform complex reasoning tasks efficiently [1][3][7]. Group 1: Model Features and Performance - Nemotron Nano v2 is a 9 billion parameter model that matches or exceeds the accuracy of the leading open-source model Qwen3-8B in complex reasoning benchmarks while being 6 times faster [1][7]. - The model supports a "reasoning trace" feature, allowing it to generate reasoning processes before providing final answers, which enhances the quality of responses, especially for complex tasks [8][11]. - Users can control the "thinking budget," specifying the number of tokens the model can use during reasoning, which helps in managing the model's performance [10][12]. Group 2: Training and Data - The model underwent extensive pre-training on over 20 trillion tokens, utilizing FP8 precision and a Warmup-Stable-Decay learning rate schedule [19]. - Post-training involved various techniques, including supervised fine-tuning and reinforcement learning from human feedback, with about 5% of the data containing intentionally truncated reasoning traces [21]. - NVIDIA has also released a significant portion of the data used for training, including a diverse pre-training dataset with 66 trillion tokens across multiple categories [26][23]. Group 3: Open Source Strategy - NVIDIA's approach contrasts with other tech giants moving towards closed-source models, emphasizing an open-source strategy with the Nemotron ecosystem [30][32]. - The company has made significant strides in open-sourcing its models, which may influence the competitive landscape in AI development [29][33].
从GPT-2到gpt-oss,深度详解OpenAI开放模型的进化之路
机器之心· 2025-08-18 05:15
Core Insights - OpenAI has released its first open-weight models, gpt-oss-120b and gpt-oss-20b, since the launch of GPT-2 in 2019, which can run locally due to optimizations [4][5] - The article provides a detailed analysis of the architectural advancements from GPT-2 to gpt-oss and compares it with Qwen3 [4][5] Model Architecture Overview - gpt-oss-20b can run on consumer-grade GPUs with 16 GB RAM, while gpt-oss-120b requires a single H100 processor with 80 GB RAM or more [10] - The architecture of gpt-oss models appears conventional, as leading LLM developers often use similar foundational architectures with minor adjustments [10][11] Changes Since GPT-2 - The article highlights significant changes from GPT-2, including the removal of Dropout, the adoption of RoPE for positional encoding, and the replacement of GELU with Swish/SwiGLU [20][22][29] - The introduction of Mixture of Experts (MoE) models allows for increased parameter capacity while maintaining efficiency by activating only a subset of experts for each token [39] - Grouped Query Attention (GQA) is introduced as a more efficient alternative to Multi-Head Attention (MHA) [41] - Sliding window attention is applied in gpt-oss to reduce memory usage and computational costs [47] - RMSNorm replaces LayerNorm for better efficiency in large-scale LLMs [52] Comparison with Qwen3 - gpt-oss-20b has a wider architecture with more attention heads, while Qwen3 has a deeper architecture with more transformer modules [69][70] - gpt-oss uses fewer but larger experts compared to Qwen3, which has more smaller experts [72] - Both models utilize grouped query attention, but gpt-oss incorporates sliding window attention to limit context size [82] Additional Insights - gpt-oss models are designed for inference, allowing users to control inference workload easily [93] - The training compute for gpt-oss is estimated at 2.1 million H100 GPU hours, comparable to other large models [92] - The MXFP4 optimization allows gpt-oss models to run on a single GPU, enhancing accessibility [98] - Benchmark results indicate that gpt-oss performs comparably to proprietary models, although it has not yet been extensively tested [101][106]
美股三大指数开盘涨跌互现,芯片股多数下跌
Group 1 - US stock market opened mixed with Dow Jones up 0.57%, S&P 500 up 0.11%, and Nasdaq down 0.02% [1] - Chip stocks mostly declined, with Applied Materials dropping over 13% due to Q4 earnings outlook falling short of analyst expectations, while Intel rose over 4% amid reports of potential US government investment [1] Group 2 - WeRide announced a multi-million dollar equity investment from Grab, aimed at large-scale deployment of L4 Robotaxis and other autonomous vehicles in Southeast Asia [2] Group 3 - Apple officially entered Xiaohongshu, hinting at a potential iPhone 17 series launch event on September 10, with invitations likely to be sent out around September 2 [3] Group 4 - Tongyi Qianwen announced multiple product upgrades, including the upcoming launch of the Qwen-Image editing model [4] Group 5 - Ford is recalling 41,875 Lincoln Aviator vehicles in the US due to rearview camera display malfunctions [5] Group 6 - SoftBank Group announced that its subsidiary PayPay has filed for an IPO in the US, with details on timing, scale, and pricing yet to be determined [6]
终于发布的GPT-5,和它改变世界的982天
3 6 Ke· 2025-08-08 04:15
Core Insights - GPT-5 was officially released on August 8, 2023, and quickly dominated the LMArena leaderboard, ranking first in all categories [3][7] - The release of GPT-5 marks a significant advancement in AI capabilities, particularly in reasoning and agentic AI, although it does not represent a leap in performance compared to its predecessor GPT-4 [8][34] - OpenAI has introduced four versions of GPT-5, catering to different user needs and scenarios, including a lightweight version and a chat-specific version [9][11] Group 1: GPT-5 Release and Features - GPT-5 integrates capabilities from both the GPT series and the o series, allowing it to automatically select the optimal model for specific tasks [11][12] - The pricing for GPT-5 is competitive, with API costs lower than those of GPT-4, making it accessible for various applications [14][17] - OpenAI aims to simplify user experience by reducing the complexity of model selection, addressing the "choice paralysis" faced by users [11][12] Group 2: Market Context and Competitive Landscape - The AI landscape is increasingly competitive, with numerous companies releasing open-source models, leading to a narrowing gap between open-source and closed-source models [54][55] - OpenAI's revenue has surged, reaching an annualized figure of $12 billion by July 2025, driven largely by consumer subscriptions [48][50] - Major tech companies like Microsoft, Google, and Meta have also seen significant growth in market value and revenue due to advancements in AI technologies [52][53] Group 3: User Engagement and Adoption - ChatGPT has achieved remarkable user engagement, with 700 million weekly active users, reflecting its deep integration into daily life [42][45] - The application has maintained a strong growth trajectory, becoming the fastest app to reach 1 billion downloads and 500 million monthly active users [47] - OpenAI's strategic focus on user-friendly applications and real-world use cases has enhanced the appeal of GPT-5 across various sectors, including education and healthcare [25][28]
DeepSeek的GRPO会导致模型崩溃?看下Qwen3新范式GSPO
机器之心· 2025-08-07 09:42
机器之心报道 机器之心编辑部 众所周知,大型语言模型的训练通常分为两个阶段。 第一 阶段 是「预训练」 ,开发者利用大规模文本数据集训练模型,让它学会预测句子中的下一个词。 第二 阶段是「后训练」 ,旨在教会模型如何更好地理解和执行人类指令。 在 LLM 后训练阶段,似乎是一个强化学习的特殊形式。用于大语言模型(LLMs)微调的强化学习(RL)算法正沿着一条明确的演进路径持续发展。 起初,OpenAI 开创了一种名为 基于 人类反馈的强化学习(RLHF) 的技术,用于改进 ChatGPT。RLHF 的核心是让人类标注员对模型生成的多种响应进行打分, 并选出最优答案作为训练参考。这一过程虽然有效,但也耗时、昂贵且依赖人力,通常需要一支小型但专业的数据标注团队。 DeepSeek 的重要创新在于用 RL 技术自动化了这一环节。算法不再依赖人工逐一评估,而是让模型在探索过程中,通过获得「奖励信号」自主学习正确行为,从 而显著降低了成本,提高了效率,最终能以较低的成本实现高性能。 OpenAI 在 ChatGPT 的训练中采用了 近端策略优化(Proximal Policy Optimization, PPO) 。 ...
硬核拆解大模型,从 DeepSeek-V3 到 Kimi K2 ,一文看懂 LLM 主流架构
机器之心· 2025-08-07 09:42
Core Viewpoint - The article discusses the evolution of large language models (LLMs) over the past seven years, highlighting that while model capabilities have improved, the overall architecture has remained consistent. It questions whether there have been any disruptive innovations or if advancements have been incremental within the existing framework [2][5]. Group 1: Architectural Innovations - The article details eight mainstream LLMs, including DeepSeek and Kimi, analyzing their architectural designs and innovative approaches [5]. - DeepSeek V3, released in December 2024, introduced key architectural technologies that enhanced computational efficiency, distinguishing it among other LLMs [10][9]. - The multi-head latent attention mechanism (MLA) is introduced as a memory-saving strategy that compresses key and value tensors into a lower-dimensional latent space, significantly reducing memory usage during inference [18][22]. Group 2: Mixture-of-Experts (MoE) - The MoE layer in the DeepSeek architecture allows for multiple parallel feedforward submodules, significantly increasing the model's parameter capacity while reducing computational costs during inference through sparse activation [23][30]. - DeepSeek V3 features 256 experts in each MoE module, with a total parameter count of 671 billion, but only activates 9 experts per token during inference [30]. Group 3: OLMo 2 and Its Design Choices - OLMo 2 is noted for its high transparency in training data and architecture, which serves as a reference for LLM development [32][34]. - The architecture of OLMo 2 includes a unique normalization strategy, utilizing RMSNorm and QK-norm to enhance training stability [38][46]. Group 4: Gemma 3 and Sliding Window Attention - Gemma 3 employs a sliding window attention mechanism to reduce memory requirements for key-value (KV) caching, representing a shift towards local attention mechanisms [53][60]. - The architecture of Gemma 3 also features a dual normalization strategy, combining Pre-Norm and Post-Norm approaches [62][68]. Group 5: Mistral Small 3.1 and Performance - Mistral Small 3.1, released in March 2023, outperforms Gemma 3 in several benchmarks, attributed to its custom tokenizer and reduced KV cache size [73][75]. - Mistral Small 3.1 adopts a standard architecture without the sliding window attention mechanism used in Gemma 3 [76]. Group 6: Llama 4 and MoE Adoption - Llama 4 incorporates MoE architecture, similar to DeepSeek V3, but with notable differences in the activation of experts and overall design [80][84]. - The MoE architecture has seen significant development and adoption in 2025, indicating a trend towards more complex and capable models [85]. Group 7: Kimi K2 and Its Innovations - Kimi K2, with a parameter count of 1 trillion, is recognized as one of the largest LLMs, utilizing the Muon optimizer variant for improved training performance [112][115]. - The architecture of Kimi K2 is based on DeepSeek V3 but expands upon its design, showcasing the ongoing evolution of LLM architectures [115].
全网开测GPT-oss!技术架构也扒明白了
量子位· 2025-08-07 00:56
Core Insights - The article highlights the impressive performance of GPT-oss, which surpasses many existing open-source models and is poised to lead in the SaaS fast-fashion era [1][3][4]. Performance Testing - GPT-oss has successfully passed multiple performance tests, achieving top rankings in various benchmarks, including GPQA Diamond, AIME 2024, AIME 2025, and Codeforces, outperforming models like DeepSeek R1, Qwen3, and Llama 4 [5][6]. - In the MMLU benchmark, GPT-oss achieved scores of 85.9 for the low 120B model and 88 for the medium model, while Qwen3-235B performed slightly better in MMLU [6][7]. Model Architecture - The architecture of GPT-oss is noted for its wider structure, more attention heads, and higher hidden dimensions compared to similar models, incorporating advanced techniques such as attention bias units [22][24][26]. - The model retains the core MoE Transformer architecture while optimizing performance and reducing complexity, making it suitable for open-source applications [26][28]. Cost and Training - The estimated cost for training the GPT-oss-120B model is between $4.2 million and $23.1 million, while the 20B model costs between $420,000 and $2.3 million [30]. - There are indications that the model may have limitations in non-English text performance, with a significant portion of responses containing grammatical or spelling errors [30]. User Applications - Users have begun exploring various applications for GPT-oss, including its integration into platforms for academic paper understanding and data transformation [17][19][20]. - The model can be easily accessed and utilized through platforms like LM Studio and AWS, facilitating rapid development of AI applications [33][34]. Community Engagement - The article encourages users to test GPT-oss and share their experiences, indicating a growing community interest in the model's capabilities [39].
监督学习未死,一题训练五小时起飞!华人学者新方法20倍训练效率释放大模型推理能力
量子位· 2025-08-04 07:00
Core Viewpoint - The article discusses the breakthrough of One-Shot Critique Fine-Tuning (One-Shot CFT) in enhancing reasoning capabilities of large language models (LLMs) with minimal data and computational resources, outperforming traditional reinforcement learning (RL) methods and small-scale supervised fine-tuning (SFT) approaches [1][3][14]. Group 1: One-Shot CFT Methodology - One-Shot CFT is a new method that allows models to learn reasoning by analyzing the quality of answers rather than merely imitating them, thus providing a deeper learning signal [3][12]. - The process involves selecting a representative task, generating multiple answers using various models, and then having a more powerful model critique these answers, which serves as the supervision signal for training [4][5]. - The entire training process requires only one question, multiple answers, and critiques, taking approximately 5 GPU hours, significantly less than RL methods [5][14]. Group 2: Performance and Results - In experiments, Qwen2.5-Math-7B achieved a 15% accuracy increase after One-Shot CFT fine-tuning on a single question, surpassing both RL and full supervised fine-tuning models that used tens of thousands of training samples [9][10]. - The method demonstrated strong performance across various mathematical and logical reasoning tasks, with accuracy improvements ranging from 10% to 16% in specific sub-tasks [10][11]. - One-Shot CFT showed stability and reproducibility across different tasks and model configurations, indicating its robustness [11][13]. Group 3: Advantages of One-Shot CFT - The method emphasizes critical learning, allowing models to understand why answers are correct or incorrect, which enhances the depth of learning compared to traditional SFT [12]. - It introduces multi-perspective inputs by generating multiple answers and critiques for a single task, closely mimicking human learning processes [12]. - The training signals from critiques are highly generalizable, reducing the risk of overfitting and allowing for easier transfer to new tasks [12]. Group 4: Accessibility and Practical Implications - One-Shot CFT's low computational cost makes it accessible for individual researchers, resource-limited labs, and startups, providing a cost-effective solution for enhancing reasoning capabilities [14][15]. - The entire process is open-source, including training scripts, model parameters, and datasets, which significantly lowers the barrier for replication and experimentation [17].