Claude Opus 4.1

Search documents
Copilot 用户狂欢!微软宣布引入 Claude 模型,OpenAI 不再被“独宠”
AI前线· 2025-09-26 12:07
整理 | 华卫 如今,微软正深化与 OpenAI 主要竞争对手 Anthropic 公司的新合作关系。从本周三起,这家软件巨 头将把 Anthropic 的 AI 模型整合到其 AI 助手 Copilot 中,而此前 Copilot 的核心技术支持主要来自 OpenAI。9 月 25 日,微软 CEO Satya Nadella 在 X 平台亲自宣布了这一消息。 这一合作协议标志着微软与昔日独家合作伙伴(指 OpenAI)之间"逐步解绑"的又一重要举措。此前 几周,微软刚签署另一项协议,宣布将 Anthropic 的 AI 技术应用于 Office 365 系列应用(如 Word、Excel 和 Outlook)。 此次整合后,Copilot 的商业用户在处理特定任务(如复杂调研、定制化 AI 工具开发、企业级智能体 构建等)时,可在两种技术方案间自主选择:一是 OpenAI 的深度推理模型,二是 Anthropic 的 Claude Opus 4.1 与 Claude Sonnet 4 模型。 其中,Claude Opus 4.1 主打复杂推理、代码编写及深度架构规划能力;而 Claude Sonnet 4 ...
2025人工智能产业十大关键词
机器人圈· 2025-09-26 09:29
Core Insights - The 2025 Artificial Intelligence Industry Conference highlighted ten key trends in AI, emphasizing the convergence of technology, applications, and ecosystems, leading to a clearer vision of a smart-native world [1]. Group 1: Foundation Super Models - In 2025, foundational models and reasoning models are advancing simultaneously, with a comprehensive capability increase of over 30% from late 2024 to August 2025 [3][4]. - Key features of leading large models include the integration of thinking and non-thinking modes, enhanced understanding and reasoning abilities, and built-in agent capabilities for real-world applications [4][6]. - The emergence of foundational super models simplifies user interaction, enhances workflow precision, and raises new data supply requirements [6]. Group 2: Autonomous Intelligent Agents - Highly encapsulated intelligent agent products are unlocking the potential of large models, showing better performance in complex tasks compared to single models [9][10]. - Current intelligent agents still have significant room for improvement, particularly in long-duration task execution and interconnectivity [12]. Group 3: Embodied Intelligence - Embodied intelligence is transitioning from laboratory settings to real-world applications, with models being deployed in practical scenarios [15][16]. - Challenges remain in data quality, model generalization, and soft-hard coordination for effective task execution [18]. Group 4: World Models - World models are emerging as a core pathway to general artificial intelligence (AGI), focusing on capabilities like data generation, action interpretation, environment interaction, and scene reconstruction [21][22]. - The development of world models faces challenges such as unclear definitions, diverse technical routes, and limited application scope [22]. Group 5: AI Reshaping Software - AI is transforming the software development lifecycle, with significant increases in token usage for programming tasks and the introduction of advanced AI tools [25][28]. - The role of software developers is evolving into more complex roles, leading to the emergence of "super individuals" [28]. Group 6: Open Intelligent Computing Ecosystem - The intelligent computing landscape is shifting towards an open-source model, fostering collaboration and innovation across various sectors [30][32]. - The synergy between software and hardware is improving, with domestic hardware achieving performance parity with leading systems [30]. Group 7: High-Quality Industry Data Sets - The focus of AI data set construction is shifting from general-purpose to high-quality industry-specific data sets, addressing critical quality issues [35][38]. - New data supply chains are needed to support advanced technologies like reinforcement learning and world models [38]. Group 8: Open Source as Standard - Open-source initiatives are reshaping the AI landscape, with significant adoption of domestic open-source models and a growing number of active developers [40][42]. - The business model is evolving towards "open-source free + high-level service charges," promoting cloud services and chip demand [42]. Group 9: Mitigating Model Hallucinations - The issue of hallucinations in large models is becoming a significant barrier to application, with ongoing research into mitigation strategies [44][46]. - Various approaches are being explored to enhance data quality, model training, and user-side testing to reduce hallucination rates [46]. Group 10: AI as an International Public Good - Global AI development is uneven, necessitating international cooperation to promote equitable access to AI technologies [49][51]. - Strategies are being implemented to address challenges in cross-border compliance and data flow, aiming to make AI a truly shared international public good [51].
谁是最强“打工AI”?OpenAI亲自测试,结果第一不是自己
量子位· 2025-09-26 04:56
西风 发自 凹非寺 量子位 | 公众号 QbitAI OpenAI发布最新研究,却在里面夸了一波Claude。 他们 提出名为 G D Pv al 的新基 准 ,用来衡量AI模型在真实世界具有经济价值的任务上的表现。 最后OpenAI还 开源了包含220项任务的优质子集 ,并提供公开的自动评分服务。 具体来说,GDPval覆盖了 对美国GDP贡献最大的9个行业中的44种职业 ,这些职业年均创收合计达3万亿美元。任务基于平均拥有14年经验 的行业专家的代表性工作设计而成。 专业评分人员将主流模型的输出结果与人类专家的成果进行了对比。 最终测试下来, Claude Opus 4.1成为表现最佳的模型,47.6%的产出被评定媲美人类专家成果 。 GPT-5 38.8%的成绩和Claude还是有些差距,位居第二;GPT-4o与人类相比只有12.4%获胜或平局。 没能成为最优,OpenAI也给自己找补了:不同模型各有优势,Claude Opus 4.1主要是在美学方面突出,而 G P T-5在准 确 性 上更优 。 OpenAI还表示,同样值得注意的是模型的进步速度,其前沿模型在短短一年内,胜率几乎实现了翻倍。 网友看 ...
AI大模型可媲美人类专家,AI人工智能ETF(512930)今日回调蓄势
Xin Lang Cai Jing· 2025-09-26 02:24
9月25日,OpenAI发布了一项新的基准测试,用于比较其AI模型与各行业专业人士的工作表现。OpenAI周四表示,其GPT-5模型以及竞争对手Anthropic公司 的Claude Opus 4.1"已经接近行业专家的工作质量"。 截至2025年9月26日 09:58,中证人工智能主题指数(930713)下跌1.58%。成分股方面涨跌互现,晶晨股份(688099)领涨3.75%,豪威集团(603501)上涨2.97%, 复旦微电(688385)上涨2.07%;三七互娱(002555)领跌4.72%,芯原股份(688521)下跌4.61%,昆仑万维(300418)下跌4.51%。AI人工智能ETF(512930)下跌 1.65%,最新报价2.2元。 数据显示,截至2025年8月29日,中证人工智能主题指数(930713)前十大权重股分别为新易盛(300502)、中际旭创(300308)、寒武纪(688256)、澜起科技 (688008)、中科曙光(603019)、科大讯飞(002230)、豪威集团(603501)、海康威视(002415)、金山办公(688111)、浪潮信息(000977),前十大权重股合计占 ...
OpenAI测试称GPT-5媲美专家
3 6 Ke· 2025-09-26 01:27
Core Insights - OpenAI's GPT-5 model and Anthropic's Claude Opus 4.1 are reported to be approaching the quality of work produced by industry experts, according to a new benchmark test called GDPval [1][2] - The GDPval test evaluates AI systems' performance in economic value work, which is crucial for developing Artificial General Intelligence (AGI) [1] - The test covers 44 occupations across nine major industries contributing to the US GDP, including healthcare, finance, manufacturing, and government [1] Group 1 - The initial version of GDPval-v0 involved senior professionals comparing AI-generated reports with those from human experts, calculating the average "win rate" of AI models [2] - GPT-5-high was rated as superior or on par with industry experts in 40.6% of cases, while Claude Opus 4.1 achieved a 49% rating, indicating a stronger performance [2] - OpenAI acknowledges that the current GDPval test only assesses a limited aspect of professional work, with plans to develop more comprehensive tests in the future [2] Group 2 - OpenAI's Chief Economist, Aaron Chatterji, stated that the results suggest professionals can save time using AI models, allowing them to focus on more meaningful tasks [3] - Tejal Patwardhan, the evaluation lead, expressed optimism about the progress of GDPval, noting that GPT-4o's score was only 13.7% about 15 months ago, while GPT-5's score has nearly tripled [3] - The trend of improving AI capabilities is expected to continue, enhancing the potential for AI to assist in various professional tasks [3]
美防长下令数百将领紧急集结 OpenAI测试称GPT-5媲美专家|环球市场
Sou Hu Cai Jing· 2025-09-26 00:09
隔夜股市 | 标的 | 周四涨跌 | | --- | --- | | 上海军学 | -0.01% | | 深证成指 | 0.67% | | 恒生指数 | -0.13% | | 目经225指数 | 0.27% | | EFEKOSPI | -0.03% | | 德国DAX30 | -0.56% | | 法国CAC40 | -0.41% | | 英国官时100 | -0.39% | | BAND FRE 50 | -0.36% | | 纳斯达克指数 | -0.50% | | 标普500指数 | -0.50% | | 道琼斯指数 | -0.38% | 全球主要指数周四普遍下跌,美股主要股指连续第三个交易日集体收跌。 商品市场 | 标的 | 周四涨跌 | | --- | --- | | NYMEX WTI原油 | 0.02% | | ICE布伦特原油 | 0.16% | | COMEX黄金 | 0.33% | | COMEX自银 | 2.89% | | NYMEX把金 | 3.05% | | NYMEX天然气 | 3.13% | | LME铜 | -0.59% | | LME铝 | 0.47% | | LME锌 | ...
美股三大指数连跌三日,科技股承压甲骨文跌超5%,中概股多数上涨
Feng Huang Wang· 2025-09-25 22:20
美东时间周四,三大股指连续第三个交易日集体下跌,此前公布的一系列政府数据和企业消息对经济前景释放出喜忧参半的信号。 (三大指数日内走势图,来源:TradingView) 截至收盘,道琼斯指数跌0.38%,报45,947.32点;标普500指数跌0.50%,报6,604.72点;纳斯达克指数跌0.50%,报22,384.70点。 盘前公布的数据显示,美国二季度GDP增速的最终值为3.8%,高于此前3.3%的预估。 根据最新的初请失业金数据,美国新增申领失业救济人数下降。而8月耐用品订单在飞机订单激增的带动下也出现反弹。 这三项数据共同支撑了这样一种观点:美国经济依然稳健,并有望再次升温——这一看法推动美股在过去几周屡创新高。 但交易员指出,市场对经济强劲的预期已基本消化,而周四市场也出现了一些与牛市叙事相背的信号。 与此同时,美债收益率走高进一步打压了科技股,促使投资者减仓规避风险。10年期美债收益率触及4.2%。 甲骨文成为拖累标普500指数的主要因素,该股下跌超5%,连续第三个交易日走低。市场对人工智能(AI)交易热潮的持续性仍心存疑虑。截至周四收盘, 甲骨文较近期高点已下跌近16%。 热门股表现 大型科技 ...
微软将使用Anthropic的AI模型来驱动人工智能助手
Ge Long Hui A P P· 2025-09-24 15:35
格隆汇9月24日|微软今天将把Anthropic的Claude Sonnet 4和Claude Opus 4.1人工智能模型带到微软365 Copilot上。这是一项重大举措,不仅扩大了Microsoft 365 Copilot中除OpenAI外的模型选择范围,而且 将允许微软的客户访问Researcher和Microsoft Copilot Studio中的Anthropic模型。微软商业和工业副驾驶 团队总裁查尔斯·拉曼纳解释说,"Copilot将继续由OpenAI的最新模型提供支持,现在我们的客户也可以 灵活地使用Anthropic模型——从Researcher开始,或者在微软Copilot Studio中构建代理。" ...
Microsoft adds Anthropic AI model to Copilot assistant, diversifying from OpenAI
CNBC· 2025-09-24 15:00
Microsoft is the lead investor in OpenAI and has long been the artificial intelligence startup's key cloud partner. But in the latest sign that AI relationships are getting complicated, Microsoft is beginning to use more technology from OpenAI rival Anthropic.The software giant said Wednesday that it's starting to draw on an AI model from Anthropic to answer some queries in the Microsoft 365 Copilot assistant for commercial clients.The effort represents another step toward diversification in generative AI f ...
GPT-5编程测评大反转,表面不及格,实际63.1%的任务没交卷,全算上成绩比Claude高一倍
3 6 Ke· 2025-09-22 11:39
Core Insights - Scale AI's new software engineering benchmark, SWE-BENCH PRO, reveals that leading models like GPT-5, Claude Opus 4.1, and Gemini 2.5 have low resolution rates, with none exceeding 25% [1][11] - The benchmark's difficulty is significantly higher than its predecessor, SWE-Bench-Verified, which had an average accuracy of 70% [4][11] - The new benchmark aims to eliminate data contamination and better reflect real-world software engineering challenges by using previously unseen tasks [4][7] Benchmark Details - SWE-BENCH PRO includes 1865 diverse code libraries categorized into three subsets: public, commercial, and reserved [7] - The public subset consists of 731 problems from 11 public code libraries, while the commercial subset includes problems from 276 startup code libraries [7] - The benchmark excludes trivial edits and focuses on complex tasks requiring multi-file modifications, enhancing the assessment's rigor [7][4] Testing Methodology - The evaluation process incorporates a "human in the loop" approach, enhancing problem statements with additional context and requirements [8][9] - Each task is assessed in a containerized environment, ensuring that models are tested under specific conditions [10] - The testing includes fail2pass and pass2pass tests to verify problem resolution and maintain existing functionality [10] Model Performance - The resolution rates for the top models are as follows: GPT-5 at 23.3%, Claude Opus 4.1 at 22.7%, and Gemini 2.5 at 13.5% [13][14] - Even the best-performing models scored below 20% in the commercial subset, indicating limited capabilities in addressing real-world business problems [13][11] - The analysis highlights that programming language difficulty and code library variations significantly impact model performance [15] Failure Analysis - Common failure modes include semantic understanding issues, syntax errors, and incorrect solutions, with GPT-5 showing a high non-response rate of 63.1% [16][17] - Claude Opus 4.1 struggles with semantic understanding, while Gemini 2.5 exhibits balanced failure rates across multiple dimensions [17][16] - QWEN3 32B, an open-source model, has the highest tool error rate, emphasizing the importance of integrated tool usage for effective performance [17]