机器之心

Search documents
硬核拆解大模型,从 DeepSeek-V3 到 Kimi K2 ,一文看懂 LLM 主流架构
机器之心· 2025-08-07 09:42
Core Viewpoint - The article discusses the evolution of large language models (LLMs) over the past seven years, highlighting that while model capabilities have improved, the overall architecture has remained consistent. It questions whether there have been any disruptive innovations or if advancements have been incremental within the existing framework [2][5]. Group 1: Architectural Innovations - The article details eight mainstream LLMs, including DeepSeek and Kimi, analyzing their architectural designs and innovative approaches [5]. - DeepSeek V3, released in December 2024, introduced key architectural technologies that enhanced computational efficiency, distinguishing it among other LLMs [10][9]. - The multi-head latent attention mechanism (MLA) is introduced as a memory-saving strategy that compresses key and value tensors into a lower-dimensional latent space, significantly reducing memory usage during inference [18][22]. Group 2: Mixture-of-Experts (MoE) - The MoE layer in the DeepSeek architecture allows for multiple parallel feedforward submodules, significantly increasing the model's parameter capacity while reducing computational costs during inference through sparse activation [23][30]. - DeepSeek V3 features 256 experts in each MoE module, with a total parameter count of 671 billion, but only activates 9 experts per token during inference [30]. Group 3: OLMo 2 and Its Design Choices - OLMo 2 is noted for its high transparency in training data and architecture, which serves as a reference for LLM development [32][34]. - The architecture of OLMo 2 includes a unique normalization strategy, utilizing RMSNorm and QK-norm to enhance training stability [38][46]. Group 4: Gemma 3 and Sliding Window Attention - Gemma 3 employs a sliding window attention mechanism to reduce memory requirements for key-value (KV) caching, representing a shift towards local attention mechanisms [53][60]. - The architecture of Gemma 3 also features a dual normalization strategy, combining Pre-Norm and Post-Norm approaches [62][68]. Group 5: Mistral Small 3.1 and Performance - Mistral Small 3.1, released in March 2023, outperforms Gemma 3 in several benchmarks, attributed to its custom tokenizer and reduced KV cache size [73][75]. - Mistral Small 3.1 adopts a standard architecture without the sliding window attention mechanism used in Gemma 3 [76]. Group 6: Llama 4 and MoE Adoption - Llama 4 incorporates MoE architecture, similar to DeepSeek V3, but with notable differences in the activation of experts and overall design [80][84]. - The MoE architecture has seen significant development and adoption in 2025, indicating a trend towards more complex and capable models [85]. Group 7: Kimi K2 and Its Innovations - Kimi K2, with a parameter count of 1 trillion, is recognized as one of the largest LLMs, utilizing the Muon optimizer variant for improved training performance [112][115]. - The architecture of Kimi K2 is based on DeepSeek V3 but expands upon its design, showcasing the ongoing evolution of LLM architectures [115].
人大高瓴-华为诺亚:大语言模型智能体记忆机制的系列研究
机器之心· 2025-08-07 02:41
本系列工作第一作者张泽宇,中国人民大学博士生,研究方向为大语言模型智能体的记忆机制和个性 化;谭浩然,中国人民大学硕士生,研究方向为大语言模型智能体。陈旭,中国人民大学预聘副教授, 研究方向包括大语言模型,信息检索等。 近期,基于大语言模型的智能体(LLM-based agent)在学术界和工业界中引起了广泛关注。对于智 能体而言,记忆(Memory)是其中的重要能力,承担了记录过往信息和外部知识的功能,对于提高智 能体的个性化等能力至关重要。中国人民大学高瓴人工智能学院与华为诺亚方舟实验室聚焦大语言模型 智能体的记忆能力,在该领域的研究早期,形成了一套完整的包括综述论文、数据集和工具包的研究体 系,致力于推动该领域的发展。 智能体记忆机制的早期综述 (TOIS'25) 在 2024 年 4 月,团队完成了早期的关于智能体记忆机制的综述。该综述从不同角度对智能体的记忆 进行了全面讨论。该综述讨论了「什么是智能体的记忆」和「为什么智能体需要记忆」,总结回顾了 「如何实现智能体的记忆」和「如何评测智能体的记忆能力」,归纳整理了「记忆增强的智能体应 用」,并提出当前工作存在的局限性和未来方向。通过该综述,团队希望能 ...
您猜怎么着?Grok 4进决赛,大模型对抗赛Gemini全军覆没,马斯克「装」起来了
机器之心· 2025-08-07 02:41
机器之心报道 机器之心编辑部 明天,Grok 对阵 OpenAI 的 o3。 谁也没想到,谷歌攒的 Kaggle AI Chess 比赛(即大模型国际象棋对抗赛),在半决赛中,Grok 4 击败 Gemini 2.5 Pro,进入总决赛! 在 昨天的比赛中 ,Gemini 2.5 Pro、o4-mini、Grok 4 和 o3 均以 4-0 的战绩分别击败 Claude 4 Opus、DeepSeek R1、Gemini 2.5 Flash 和 Kimi k2,晋级半决赛。 今天的战况依旧让人猜不着走向,Gemini 2.5 Pro 败了。 马斯克昨天点评比赛结果的话术,今天依旧有用:「国际象棋太过简单,对 Grok 来说,只是副作用,我们没花多少力气放在象棋优化上。」 今天 Grok 4 闯入总决赛,不知马斯克是不是更看不上这场比赛了。 我们再回到这场半决赛。 战况是 Grok 4 和 o3 分别战胜了 Gemini 2.5 Pro 和 o4-mini,成功晋级决赛 。虽然 o3 的胜利在大家意料之中,但 Grok 与 Gemini 之间的激烈对决却让所有人大跌 眼镜 —— 双方在常规赛打成 2:2 平,最 ...
Token成本下降,订阅费却飞涨,AI公司怎么了?
机器之心· 2025-08-06 04:31
Core Viewpoint - The article discusses the challenges faced by AI companies in balancing subscription pricing and operational costs, highlighting a potential "prisoner's dilemma" where companies struggle between offering unlimited subscriptions and usage-based pricing, leading to unsustainable business models [3][45][46]. Group 1 - DeepSeek's emergence in the AI space was marked by its impressive training cost of over $5 million, which contributed to its popularity [1]. - The training costs for AI models have decreased significantly, with Deep Cogito reportedly achieving a competitive model for under $3.5 million [2]. - Despite the decreasing training costs, operational costs, particularly for inference, are rising sharply, creating a dilemma for AI companies [3][15]. Group 2 - Companies are adopting low-cost subscription models, such as $20 per month, to attract users, banking on future cost reductions in model training [7][12]. - The expectation that model costs will decrease by tenfold does not alleviate the pressure on subscription services, as operational costs continue to rise [5][13]. - The reality is that even with cheaper models, profit margins are declining, as evidenced by the experiences of companies like Windsurf and Claude Code [14][15]. Group 3 - Users are increasingly demanding the latest and most powerful models, leading to a rapid shift in demand towards new releases, regardless of previous models' cost reductions [17][21]. - The pricing history of leading models shows that while initial costs may drop, the demand for the latest technology keeps prices stable [20][22]. - The consumption of tokens has increased dramatically, with the number of tokens used per task doubling every six months, leading to unexpected cost increases [28][29]. Group 4 - Companies like Anthropic have attempted to address cost pressures by implementing strategies such as increasing subscription prices and optimizing model usage based on load [38][40]. - Despite these efforts, the consumption of tokens continues to rise exponentially, making it difficult to maintain sustainable pricing models [41][44]. - The article suggests that a fixed subscription model is no longer viable in the current landscape, as companies face a fundamental shift in pricing dynamics [44][60]. Group 5 - The article outlines three potential strategies for AI companies to navigate the cost pressures: adopting usage-based pricing from the start, targeting high-margin enterprise clients, and vertically integrating to capture value across the tech stack [51][52][57]. - Companies that continue to rely on fixed-rate subscription models are likely to face significant challenges and potential failure [60][62]. - The expectation that future model costs will decrease significantly may not align with the increasing user expectations for performance and capabilities [61][64].
ICCV 2025 | SeaS: 工业异常生成+正常合成+精准掩码大一统框架,指标全面碾压SOTA
机器之心· 2025-08-06 04:31
Core Viewpoint - The article discusses the SeaS model, a unified few-shot industrial anomaly generation method that addresses the challenges of generating diverse anomaly samples and precise mask annotations in industrial quality inspection, significantly improving the performance of downstream anomaly detection tasks [3][45]. Group 1: Model Overview - SeaS utilizes a unified framework that requires only 1-3 training samples to simultaneously achieve diverse anomaly generation, consistent normal product synthesis, and pixel-level precise mask annotation, setting a new benchmark in the field [9][45]. - The model leverages a separation and sharing fine-tuning mechanism to model the different change patterns of normal products and anomalies, enhancing the precision of the generation process while maintaining the diversity of anomalies and consistency of normal products [10][45]. Group 2: Technical Innovations - SeaS introduces three major innovations: a unified few-shot generation framework, a separation and sharing fine-tuning mechanism, and a refined mask prediction branch that integrates U-Net discriminative features with high-resolution VAE features for pixel-level accurate anomaly labeling [8][10][45]. - The model employs an unbalanced anomaly text prompt structure to effectively represent the inherent differences between normal and abnormal products, ensuring precise control over the changes in anomaly regions [15][45]. Group 3: Performance Metrics - SeaS outperforms existing few-shot industrial anomaly generation methods across key metrics on mainstream industrial datasets such as MVTec AD and VisA, with an average improvement of 12.79% in anomaly segmentation IoU [7][32][41]. - The generated data from SeaS significantly enhances the performance of supervised segmentation models, with notable improvements in metrics such as AUROC and pixel-level accuracy across various datasets [38][41][43]. Group 4: Practical Applications - The generated anomaly samples from SeaS can be effectively applied to synthetic data-based detection methods, leading to significant improvements in detection performance and a reduction in false negatives across multiple datasets [37][45]. - The model's ability to generate high-quality normal images also aids in augmenting training datasets for unsupervised detection methods, resulting in reduced false positives and optimized performance metrics [37][41].
闹玩呢!首届大模型对抗赛,DeepSeek、Kimi第一轮被淘汰了
机器之心· 2025-08-06 04:31
Core Viewpoint - The article discusses the results of the first large model chess competition organized by Google, highlighting the performance of various AI models, particularly Grok 4, which emerged as a strong contender with a perfect record [2][30]. Group 1: Competition Overview - The chess competition lasted three days and featured models such as Gemini 2.5 Pro, o4-mini, Grok 4, and o3, all achieving a 4-0 victory in the first round [4]. - The competition was held on the Kaggle Game Arena platform, aiming to evaluate the performance of large language models (LLMs) in dynamic and competitive environments [6]. Group 2: Match Results - Kimi k2 lost to o3 with a score of 0-4, failing to make legal moves in all four games [7][8]. - o4-mini defeated DeepSeek R1 with a score of 4-0, showcasing a decline in game quality after a few strong opening moves [18][21]. - Gemini 2.5 Pro won against Claude 4 Opus with a score of 4-0, although its true strength remains uncertain due to Claude's mistakes [23][24]. - Grok 4 achieved a perfect score of 4-0 against Gemini 2.5 Flash, demonstrating superior chess skills and the ability to capitalize on unprotected pieces [30][33]. Group 3: Key Observations - The competition revealed three main weaknesses in current AI models: insufficient global board visualization, limited understanding of piece interactions, and issues executing legal moves [36]. - Grok 4's performance suggests it may have overcome these limitations, raising questions about the stability of these advantages in future matches [36]. Group 4: Audience Engagement - A poll conducted prior to the competition indicated that 37% of participants favored Gemini 2.5 Pro as the likely winner, with Grok 4 receiving 7.04% of the votes [37][38].
就是阻击OpenAI,Claude抢先数十分钟发布Claude Opus 4.1
机器之心· 2025-08-06 01:49
Core Viewpoint - The article discusses the competitive landscape in AI model development, highlighting the release of Anthropic's Claude Opus 4.1 shortly before OpenAI's anticipated announcement, suggesting a strategic move by Anthropic to capture market attention [1][2]. Summary by Sections Model Release and Features - Anthropic has launched Claude Opus 4.1, which is built on the previous Claude Opus 4 model released in May. The new model shows significant improvements in agent tasks, real-world programming, and reasoning capabilities, featuring a context window of approximately 200K [7]. - Claude Opus 4.1 is available for various user tiers, including Claude Pro, Max, Team, and Enterprise [8]. Pricing and Cost Efficiency - The API pricing for Claude Opus 4.1 is set at $15 per million input tokens and $75 per million output tokens. Users can save up to 90% on costs with prompt caching and up to 50% with batch processing [10][11]. Performance Improvements - According to GitHub evaluations, Claude Opus 4.1 has outperformed its predecessor in most capabilities, particularly in multi-file code refactoring. Users from Rakuten Group noted its precision in handling large codebases without introducing new bugs [14]. - The performance leap of Claude Opus 4.1 is compared to the upgrade from Sonnet 3.7 to Sonnet 4, indicating substantial advancements [15]. Benchmark Comparisons - In various benchmarks, Claude Opus 4.1 shows superior performance compared to other models, achieving 74.5% in agentic coding SWE-bench and 80.9% in graduate-level reasoning GPQA Diamond [16]. Use Cases - Claude Opus 4.1 supports mixed reasoning modes for instant responses and detailed reasoning processes. It is particularly effective in advanced programming tasks and intelligent search and research applications, capable of conducting extensive autonomous research across diverse data sources [17][18]. Additional Information - Anthropic has also released a system card alongside the new model, providing further insights into its functionalities [19].
Discrete Tokenization:多模态大模型的关键基石,首个系统化综述发布
机器之心· 2025-08-05 18:56
Core Insights - The article discusses the advancements in Discrete Tokenization for Multimodal Large Language Models (LLMs), emphasizing its role in transforming various modalities into discrete representations that LLMs can process effectively [2][39]. - A comprehensive survey has been released, detailing the technical landscape, challenges, and future research directions in the field of Discrete Tokenization for Multimodal LLMs [2][39]. Multimodal LLMs and Discrete Tokenization - Recent breakthroughs in Large Language Models (LLMs) have led to their application in various text tasks, prompting interest in extending their capabilities to non-text modalities such as images, audio, and video [2]. - Discrete Tokenization has emerged as a key solution, utilizing techniques like Vector Quantization (VQ) to compress high-dimensional continuous inputs into compact discrete tokens, enhancing cross-modal understanding and generation [2][39]. Systematic Review and Methodologies - The article presents the first systematic review of Discrete Tokenization for Multimodal LLMs, organizing content based on input data modalities and combinations, from early single-modal to multi-modal tokenization methods [2][39]. - Eight core categories of Vector Quantization methods are identified, including VQ, RVQ, PQ, AQ, FSQ, LFQ, BSQ, and Graph Anchor-Relation Tokenization, each with unique characteristics suitable for different modalities and tasks [8][9][14]. Challenges and Future Directions - Key challenges in Discrete Tokenization include codebook collapse, information loss during quantization, difficulties in gradient propagation, and issues with granularity and semantic alignment [12][36]. - Future research directions may focus on adaptive quantization, unified frameworks, biologically inspired codebooks, cross-modal generalization, and enhancing interpretability [37][36]. Applications in Single and Multimodal Tasks - Discrete Tokenization has been widely applied in single-modal tasks such as image retrieval, audio encoding, and video representation, allowing LLMs to process non-text modalities effectively [20][22]. - In multimodal tasks, it serves as a semantic bridge, enabling models to handle complex inputs across different modalities, facilitating tasks like cross-modal retrieval and generation [27][30].
电商上演「魔法对轰」:卖家用AI假图骗下单,买家拿AI烂水果骗退款
机器之心· 2025-08-05 08:41
Core Viewpoint - The article discusses the increasing misuse of AI technology by both buyers and sellers in e-commerce, leading to a trust crisis and the need for better verification methods to combat fraud [2][10][21]. Group 1: Buyer Misuse of AI - Some buyers are using AI-generated images to falsely claim product defects in order to obtain refunds, exploiting the difficulty of verifying the condition of perishable goods like fruits [2][6]. - This practice has evolved from earlier methods where buyers used basic photo editing tools, making it harder for sellers to detect fraud due to the sophistication of AI-generated images [8][10]. - The phenomenon reflects a "tit-for-tat" mentality among buyers who have previously been deceived by sellers using AI-enhanced product images [10][21]. Group 2: Seller Misuse of AI - Sellers are also misusing AI to create misleading product images, over-enhancing ordinary items, and generating fake reviews, which contributes to the issue of "goods not matching the description" [10][24]. - The article highlights that sellers may use virtual models and AI-generated content to cut costs, further complicating the authenticity of product representations [10][24]. Group 3: Proposed Solutions - Various proposed solutions to combat this issue include requiring buyers to submit videos of defective products, taking multiple photos from different angles, and using in-app cameras to prevent the upload of AI-generated images [11][15][24]. - However, these solutions have limitations, as advanced AI tools can still generate convincing content, making it challenging to establish foolproof verification methods [11][15][23]. Group 4: Technological Innovations - The article suggests that implementing digital watermarking and content provenance technologies could help in identifying and tracing AI-generated content, thus enhancing trust in e-commerce [19][21]. - The development of standards like C2PA and tools such as Google's SynthID aims to embed invisible watermarks in AI-generated media, which could serve as a digital identity for content [19][21][26]. Group 5: Ongoing Challenges - The ongoing "cat-and-mouse" game between AI generation and detection technologies poses a continuous challenge, as both sides evolve rapidly [23][24]. - E-commerce platforms are exploring various strategies, including strengthening evidence chains and utilizing big data analytics to monitor user behavior and detect anomalies [24][26].
科研写作神器,超越Mathpix的科学公式提取工具已开源
机器之心· 2025-08-05 08:41
LaTeX 公式的光学字符识别(OCR)是科学文献数字化与智能处理的基础环节,尽管该领域取得了一定进展,现有方法在真实科学文献处理时仍面临诸多挑战: 其一,主流方法及公开数据集多聚焦于结构简单、符号单一的公式,难以覆盖多学科、高难度的复杂公式;其二,实际文档中广泛存在的多行公式、长公式、分 段公式及页面级复杂排版等情况尚未得到充分关注与处理;其三,大多数方法依赖专用模型,通常需要针对特定任务进行专门设计,难以实现通用性和扩展性。 针对上述挑战,DocTron 团队提出了系统性解决方案。 首先,针对现有数据集覆盖面有限、结构单一的问题,构建了涵盖 多学科、多结构的大规模 高难度数据集 CSFormula ,包含行级、段落级和页面级的复杂排 版。 其次,团队提出的 DocTron-Formula 模型 突破了对特定结构建模的依赖,采用通用大模型驱动的复杂公式识别方法,仅需简单微调即可适配多样化应用场景。 最后,相比于最优的定制化公式识别模型,该方法不仅在主流的开源评测中取得了优秀的性能表现,在实际应用中常见的页面级、段落级复杂排版场景中也取得 了显著优势,推动了公式识别的应用边界。 $$\sigma^{2}=\i ...