Workflow
Large Language Model
icon
Search documents
真够卷的!DeepSeek更完智谱更:GLM-4.6,代码国内最强
量子位· 2025-09-30 08:26
金磊 发自 凹非寺 量子位 | 公众号 QbitAI 好好好,都赶着国庆节之前开卷是吧。 前脚DeepSeek更新到了V3.2,现在 智谱 又更新了—— 正式推出 GLM-4.6 ,代码能力直接推到了 国内最强 。 根据智谱的测试结果,他们在Claude Code环境下进行了74个真实场景编程任务测试:GLM-4.6实测超过Claude Sonnet 4,超越其他国 产模型。 类似的结果还出现在了其它测评中。 例如在通用能力评测上,GLM-4.6在AIME 25、GPQA、LCB v6、HLE、SWE-Bench Verified、BrowseComp、Terminal-Bench、τ^2- Bench、GPQA,这八大榜单中大部分都已经对齐了Claude Sonnet 4,国内第一。 分数高还只是一方面,智谱的GLM-4.6甚至还把 "平均token消耗" 给打了下来——比GLM-4.5节省30%以上,为同类模型最低。 而且智谱这次还大大方方地把全部测试题目与Agent轨迹亮了出来,方便大家复现验证: https://huggingface.co/datasets/zai-org/CC-Bench-traj ...
Prediction: Wall Street's Most Valuable Public Company by 2030 Will Be This Dual-Industry Leader (No, Not Nvidia)
The Motley Fool· 2025-09-28 07:06
Core Insights - A historically inexpensive trillion-dollar business is positioned to surpass Nvidia, Apple, and Microsoft by the end of the decade [1] - Wall Street's trillion-dollar businesses, including Nvidia, Apple, Broadcom, and TSMC, are key drivers of ongoing market outperformance [2] Company Analysis - Only 11 publicly traded companies have reached a $1 trillion market cap, with 10 listed on U.S. exchanges, including the "Magnificent Seven" and Berkshire Hathaway [3] - Nvidia currently holds a market cap exceeding $4.3 trillion and is projected to potentially surpass $6 trillion based on optimistic analyst targets [6] - Nvidia's dominance in AI GPUs is supported by strong demand and significant order backlogs for its advanced AI chips [7] - Despite Nvidia's competitive advantages, historical trends suggest that its position may not be secure due to potential market corrections and competition [9][10] - Amazon is identified as a strong candidate to become Wall Street's most valuable company by 2030, leveraging its e-commerce and cloud services [14] - Amazon's e-commerce segment holds a 37.6% share of U.S. online retail sales, while its AWS platform commands a 32% share of global cloud infrastructure spending [15][17] - AWS is experiencing high-teens percentage growth year-over-year and is projected to generate over $123 billion in annual run-rate revenue [18][19] - Amazon's advertising and subscription services contribute significantly to its revenue, enhancing its pricing power [20] - Amazon is currently valued at only 8 times projected cash flow in 2029, indicating potential for substantial market value growth [22]
视远·正心明智——机器之心2025年度AI榜单正式启动
机器之心· 2025-09-26 03:31
Core Viewpoint - The article emphasizes the ongoing advancements in artificial intelligence (AI) as of 2025, highlighting the rapid iteration of large models and the emergence of new applications, particularly in China, where domestic models are approaching or surpassing international standards [2][3][4]. Summary by Sections AI Development Trends - In 2025, AI continues to evolve with significant breakthroughs in large models, including GPT-4.5, GPT-5, and Genie 3, enhancing capabilities in understanding, generation, and reasoning [3][4]. - The advancements in model capabilities are leading to new application forms, such as automated code generation and multi-step task completion in intelligent agents [4]. Domestic AI Landscape - China's AI development in 2025 is marked by domestic large models not only matching but also leading in performance compared to international counterparts, with a strong open-source ecosystem [4]. - Recent rankings show that all top 15 open-source AI models on the Design Arena leaderboard are from China [4]. Recognition of AI Leaders - The article outlines a curated list of top companies and products in AI for 2025, recognizing those with significant technological strength and innovation [6][7][8][9][10][11][12][13]. - Categories include: - **Top 10 Companies with Strong Technical Strength**: Companies that have made long-term investments in AI technology and maintain a leading position in the field [7]. - **Top 20 AI Leading Companies**: Firms that have established comprehensive operational capabilities and competitive advantages in AI technology and applications [8]. - **Top 20 Best Large Models**: Recognizing representative and powerful foundational models in the domestic market [9]. - **Top 20 Best Large Model Products**: Highlighting valuable new products and applications based on large models [10]. - **Top 10 Leading Companies in Embodied Intelligence**: Companies with systematic technology layouts and continuous innovation in the field of embodied intelligence [12]. - **Top 10 Leading Companies in ScienceAI**: Firms focusing on the intersection of AI and other scientific disciplines, driving industry development through innovative solutions [13].
阿里巴巴(09988)正式推出其迄今为止规模最大、能力最强的模型 Qwen3-Max
智通财经网· 2025-09-24 03:07
Core Insights - Alibaba Cloud Tongyi Qwen has launched its largest and most powerful model to date, Qwen3-Max, following the release of the Qwen3-2507 series [1] - The preview version of Qwen3-Max-Instruct ranks third on the LMArena text leaderboard, surpassing GPT-5-Chat [1] - The official version of Qwen3-Max has enhanced capabilities in coding and agent functions, achieving industry-leading performance across various benchmarks [1] Model Specifications - Qwen3-Max has over 1 trillion parameters and was pre-trained using 36 trillion tokens [1] - The model architecture follows the design paradigm of the Qwen3 series and utilizes a global-batch load balancing loss proposed by Tongyi [1] Enhanced Version - The reasoning-enhanced version, Qwen3-Max-Thinking, has demonstrated exceptional potential, achieving 100% accuracy in high-difficulty reasoning benchmarks such as AIME 25 and HMMT [1] - This version integrates a code interpreter and employs parallel testing computational techniques [1]
Trump Brings in Oracle to Manage the TikTok Algorithm in US
Youtube· 2025-09-22 17:03
Core Viewpoint - The White House is eager to finalize a deal involving TikTok, with Oracle as the lead company alongside private investors, focusing on algorithm management and data control [1][3][10]. Group 1: Deal Structure and Participants - Oracle is positioned to own TikTok in partnership with private investors, indicating a shift towards US ownership [1][3]. - The algorithm for TikTok will either be rewritten or licensed, addressing previous concerns about data management [1][10]. - The involvement of multiple private investors complicates strategic decision-making, especially in the context of AI advancements [2][10]. Group 2: Leadership Changes at Oracle - Oracle has announced a leadership transition, with Saffra Catz being succeeded by two co-CEOs, one of whom oversees Oracle Cloud Infrastructure, crucial for the TikTok deal [3][5]. - This change reflects a move towards younger leadership, potentially aligning with the company's focus on AI and cloud services [4][5]. Group 3: Competitive Landscape and Challenges - Competitors like YouTube and Instagram are benefiting from TikTok's uncertainty, as creators explore alternative platforms [6][7]. - The focus in the industry has shifted from recommendation algorithms to leveraging AI capabilities based on available data [7][8]. - Smaller players, such as Snapchat, may struggle to compete due to limited infrastructure for developing large language models [9]. Group 4: Regulatory and Operational Considerations - The transaction is complex due to US laws mandating TikTok's sale to US owners while prohibiting ByteDance from any operational role [10][11]. - China's laws restrict the export of sensitive technologies, complicating the disentanglement of TikTok from ByteDance [11]. - Oracle's hosting of TikTok has been ongoing, suggesting a level of operational control that may appease regulators [12]. Group 5: Future Leadership and Strategy - Uncertainty remains regarding the future leadership of TikTok USA, with no confirmed CEO or CFO as the transaction is not finalized [12][13]. - The focus on algorithm development may overshadow opportunities in large language models, which could be pivotal for TikTok's future [14].
Ark's Cathie Wood on H-1B Visas, China Tech Sector, TikTok Takeover
Youtube· 2025-09-22 08:54
Group 1: H-1B Visa and Tech Industry Impact - The new application fee for H-1B visas is part of President Trump's negotiation strategy with India, which may impact tech companies reliant on foreign workers [1][4] - The administration aims to retain foreign students educated in the U.S., which could influence innovation in Silicon Valley [3][4] - In the short term, tech companies may need to enhance efficiency due to potential restrictions on H-1B visas [4] Group 2: AI and Coding Job Market - The number of coding jobs has significantly decreased due to advancements in AI, which allows more individuals to engage in coding [5][6] - Companies are experiencing productivity increases despite a reduction in new job openings, which is sustaining profit margins [12][13] Group 3: Chinese Tech Market Dynamics - Chinese tech valuations are approximately half of those in the U.S., indicating a potential for growth and competition [6][7] - China's focus on open-source software is accelerating its tech development, particularly after U.S. companies halted sales to avoid IP theft [7][8] - The electric vehicle sector in China is reassessing commoditization, which may lead to more strategic development [8] Group 4: Investment Trends and Market Competition - The competition in the large language model space is narrowing, with a few key players emerging [11][12] - Companies are willing to invest significantly in AI talent, indicating a strong market interest despite recent tariff impacts [13] - The digital asset space is seeing increased exposure, with Bitcoin leading the market, while other cryptocurrencies are also being monitored [24][25]
GPT-5编程测评大反转!表面不及格,实际63.1%的任务没交卷,全算上成绩比Claude高一倍
量子位· 2025-09-22 08:08
Core Insights - The article discusses the performance of leading AI models on the new software engineering benchmark SWE-BENCH PRO, revealing that none of the top models achieved a solution rate above 25% [1][23]. Group 1: Benchmark Overview - SWE-BENCH PRO is a new benchmark that presents more challenging tasks compared to its predecessor, SWE-Bench-Verified, which had an average accuracy of 70% [5][6]. - The new benchmark aims to eliminate data contamination risks by ensuring that models have not encountered the test content during training [9][12]. - SWE-BENCH PRO includes a diverse codebase of 1865 commercial applications, B2B services, and developer tools, structured into public, commercial, and reserved subsets [12][18]. Group 2: Model Performance - The top-performing models on the public set were GPT-5 and Claude Opus 4.1, with solution rates of 23.3% and 22.7%, respectively [25][26]. - In the commercial set, even the best models scored below 20%, indicating limited capabilities in solving real-world business problems [27][28]. - The performance of models varied significantly across programming languages, with Go and Python generally performing better than JavaScript and TypeScript [30]. Group 3: Failure Analysis - The primary failure modes for the models included semantic understanding issues, syntax errors, and incorrect answers, highlighting challenges in problem comprehension and algorithm correctness [34]. - GPT-5 exhibited a high unanswered rate of 63.1%, indicating that while it performs well on certain tasks, it struggles with more complex problems [32]. - The analysis suggests that the difficulty of programming languages, the nature of codebases, and the types of models are key factors influencing performance [28][29].
ScienceQA最新榜单出炉!多家公司新模型分数均提升|xbench 月报
红杉汇· 2025-09-22 00:27
Core Insights - The latest xbench Leaderboard has been released, showcasing updates from six models that have entered the top 10, including GPT-5-high and Qwen3-235B-A22B-Thinking-2507, with scores improving by 3-5 points [1][9][10] - The dual-track evaluation system continues to track advancements in AGI, with a new question bank for the xbench-DeepSearch set expected to be released soon [1][2] Model Performance Summary - GPT-5-high from OpenAI shows a significant average score increase from 60.8 to 64.4, maintaining a stable BoN (N=5) score [9][12] - Qwen3-235B-A22B-Thinking-2507 has improved its average score from 45.4 to 55, with BoN scores rising from 66 to 77, indicating substantial enhancements [9][35] - Claude Opus 4.1-Extended Thinking has increased its average score from 46.6 to 53.2, with a slight BoN increase from 69 to 72 [9] - Kimi K2 0905 achieved an average score of 51.6, demonstrating a balance between model capability and response speed [9][28] - GLM-4.5 from ZHIPU scored 48.8 with a BoN of 74, while Hunyuan-T1-20250711 scored 44.4 with a BoN of 63 [9] - Grok-4 has shown a remarkable improvement, achieving a score of 65, marking it as a state-of-the-art model [9][10] Evaluation Insights - The distribution of model scores indicates a narrowing gap among the top performers, with the top five models scoring between 76-78 [10] - The overall performance of models suggests that advancements in model capabilities are reaching a plateau, with smaller incremental improvements noted across most models [10][12] - The xbench evaluation mechanism continues to provide real-time updates on model performance, with future rankings expected [2][8]
X @The Economist
The Economist· 2025-09-18 15:30
The Emiratis’ carefully calibrated large language model https://t.co/KYKJ4kgPct ...
DeepSeek-R1登上Nature封面:朝着AI透明化迈出的可喜一步
3 6 Ke· 2025-09-18 02:02
开源人工智能(AI)的价值正获得更广泛的认可。 刚刚,DeepSeek-R1 论文以封面文章的形式登上了权威科学期刊 Nature,DeepSeek 创始人兼 CEO 梁文峰为该论文的通讯作者。 论文链接:https://www.nature.com/articles/s41586-025-09422-z 研究团队假设,人类定义的推理模式可能会限制模型的探索,而无限制的强化学习(RL)训练可以更好地激励大语言模型(LLM)中新推理能力的涌 现。 他们通过实验证明,LLM 的推理能力可以通过纯 RL 来提升,从而减少增强性能所需的人类输入工作量,且在数学、编程竞赛和 STEM 领域研究生水平 问题等任务上,比经传统方法训练的 LLM 表现更好。 DeepSeek-R1 推出后,得到了全球开发者的广泛好评,截至发文前,其在 GitHub 上的 star 数已经达到了 91.1k。 在一篇同期发表的观点与评论文章中,卡内基梅隆大学助理教授Daphne Ippolito和他的博士生张益铭(现为 Anthropic 的 LLM 安全和对齐研究员)评价 道: "DeepSeek-R1 已从一个强大但不透明的解决方案寻找者 ...