国产大模型在多项基准测试中超越GPT-5

Core Insights - The founders of Moonlight Dark Side, Yang Zhilin, Zhou Xinyu, and Wu Yuxin, recently engaged in a lengthy online Q&A session on Reddit, discussing their new Kimi K2 Thinking model, which has surpassed GPT-5 in several benchmark tests, drawing significant attention from the global AI community [1][3]. Model Performance - The Kimi K2 Thinking model, launched on November 6, is described as the most powerful open-source thinking model to date, achieving state-of-the-art (SOTA) performance in multiple authoritative benchmark tests [3]. - In the Humanity's Last Exam (HLE) test, K2 Thinking scored 44.9%, outperforming GPT-5's 41.7%. In the BrowseComp benchmark, it achieved 60.2%, compared to GPT-5's 54.9%. Additionally, in the SEAL-0 test, K2 Thinking scored 56.3%, exceeding GPT-5's 51.4% [3][4]. Technical Features - K2 Thinking can autonomously perform 200 to 300 tool calls to solve complex problems, maintaining task continuity through an interleaved execution mode of "thinking-tool-thinking-tool," which is relatively novel in large language models [4][5]. - The model employs end-to-end reinforcement learning to ensure stable performance across hundreds of tool calls, including retrieval processes [5]. Engineering Optimization - The team demonstrated exceptional engineering optimization despite limited computational resources, utilizing an H800 GPU cluster with Infiniband, maximizing the performance of each GPU [7][8]. - The training cost was discussed, with the founders indicating that the reported $4.6 million figure is not an official number, as the true cost is difficult to quantify due to the significant research and experimentation involved [8]. Open Source Strategy - Moonlight Dark Side's commitment to an open-source strategy has garnered broader international recognition for Chinese AI models. Following the ban on Chinese IPs from accessing certain models, Kimi K2's usage surged, with its API priced at one-fifth of Claude Sonnet's, showcasing significant cost-effectiveness [10]. - Despite concerns about the risks associated with "Chinese LLMs," the founders believe that the open-source model can alleviate some of these apprehensions, promoting collaboration rather than division [10]. Market Position - In a recent ranking of model usage, Chinese models occupied seven of the top twenty spots, with Kimi K2 and Grok4 leading in growth, processing over 10 billion tokens daily [10][11]. Future Developments - The company is planning the next-generation K3 model, which will introduce significant architectural changes, including the experimental Kimi Delta Attention (KDA) module, which has shown promising results in enhancing performance across various evaluation dimensions [12].