Claude Opus 4

Search documents
刚刚,Anthropic新CTO上任,与Meta、OpenAI的AI基础设施之争一触即发
机器之心· 2025-10-03 00:24
机器之心报道 机器之心编辑部 就在刚刚,Anthropic 迎来了新的首席技术官(CTO)—— 前 Stripe 首席技术官 Rahul Patil。 据报道,Rahul Patil 于本周早些时候加入公司,接替了联合创始人 Sam McCandlish,后者将转任首席架构师一职。 Rahul Patil 在社媒上表达了自己加入 Anthropic 的激动之情与未来期许。他表示,自己很高兴加入一个新的使命和召唤。AI 的可能性是无穷无尽的,这将是一次非 凡的发现之旅,需要付出努力,将这些可能性变为现实。 更重要的是,这将要求我们每天做出深思熟虑的决策,以安全地驾驭这一巨大变革,确保负责任的 AI 最终获胜。 他很感激能够加入 Anthropic 这个谦逊、聪明、勤奋且有责任感的团队,他们激发了全球无数的想象力!他还感谢每一位与自己建设 Stripe 的团队成员,感谢他们 过去五年多的深刻变革! 作为 CTO,Rahul Patil 将负责计算、基础设施、推理以及其他各类工程任务。Sam McCandlish 在担任首席架构师期间,将继续从事预训练和大规模模型训练的工 作,扩展之前的工作。他们二人都将向 Ant ...
先发制人!Anthropic发布Claude 4.5 以“30小时独立编码”能力狙击OpenAI大会
智通财经网· 2025-09-30 02:05
Anthropic联合创始人兼首席科学官Jared Kaplan表示,Claude Sonnet 4.5在"几乎所有方面"都比该公司最 新的高端模型Opus更出色。同时他透露,Anthropic也在研发Opus的升级版本,预计将于今年晚些时候 推出。他还补充道:"两种不同规模的模型(指Sonnet与Opus)各有应用场景,我们能从它们的实际使用中 分别获得改进灵感与收益。" Anthropic还指出,新模型在满足实际业务需求方面取得了显著进展——而这正是当前行业观察人士日 益关注的焦点。近几周的多项研究显示,AI尚未为争相采用它的公司带来显著效益。对此Anthropic强 调,Claude Sonnet 4.5在网络安全、金融服务等行业的特定任务中表现尤为突出。 Anthropic首席产品官Mike Krieger表示,要让企业充分挖掘AI的价值,"还有几件事需要落实"。他认 为,这既包括AI模型本身的持续优化,也需要"用户逐渐适应并调整自身工作流程"。此外他还提 到,"前沿AI实验室与企业之间还需建立更深层次的合作关系"。 智通财经APP获悉,Anthropic近日发布了一款全新人工智能(AI)模型,其设计 ...
Study: AI LLM Models Now Master Highest CFA Exam Level
Yahoo Finance· 2025-09-22 17:43
You can find original article here Wealthmanagement. Subscribe to our free daily Wealthmanagement newsletters. In 2024, a study by J.P. Morgan AI Research and Queen’s University found that leading proprietary artificial intelligence models could pass the CFA Level I and II mock exams, but they struggled with the essay portion of the Level III exam. A new research study has found that today’s leading large language models can now clear the CFA Level III exam, including the essay portion. The CFA Level III ...
马斯克开始疯狂剧透Grok 5了
Sou Hu Cai Jing· 2025-09-18 06:34
Core Insights - Musk's Grok 5 is anticipated to achieve AGI, following the success of Grok 4, which has surpassed expectations in various rankings [4][15] - The ARC-AGI leaderboard evaluates AI's ability to solve complex problems, with Grok 4 performing notably well [6][11] - Grok 5 is set to begin training soon, with a significant increase in training data and hardware resources compared to previous versions [15][18] Group 1 - Grok 4 has achieved top rankings on multiple lists within two months of its release, indicating strong performance [4][11] - The ARC-AGI leaderboard assesses AI models' reasoning capabilities, with Grok 4 scoring 66.7% and 16% on different tasks [6][11] - Musk expresses confidence that Grok 5 could potentially reach AGI, estimating a possibility of 10% or higher [14][15] Group 2 - Grok 5 will have a larger training dataset than Grok 4, which already had 100 times the training volume of Grok 2 and 10 times that of Grok 3 [15][18] - Musk's xAI has a robust data collection system, utilizing Tesla's FSD and cameras to generate data, ensuring a wealth of training material [18] - The dedicated supercomputing cluster, Colossus, has approximately 230,000 GPUs, including 30,000 NVIDIA GB200s, to support Grok's training [18]
马斯克开始疯狂剧透Grok 5了
量子位· 2025-09-18 06:09
Core Viewpoint - The article discusses the advancements of Musk's Grok AI models, particularly Grok 5, which is anticipated to achieve Artificial General Intelligence (AGI) and surpass existing models like OpenAI's GPT-5 and Anthropic's Claude Opus 4 [6][19][20]. Group 1: Grok Model Performance - Grok 4 has shown exceptional performance, achieving top scores on multiple benchmarks shortly after its release, indicating its strong capabilities in complex problem-solving [8][10]. - In the ARC-AGI leaderboard, Grok 4 scored 66.7% and 16% on v1 and v2 tests, respectively, outperforming Claude Opus 4 and showing competitive results against GPT-5 [13]. - New approaches based on Grok 4 have been developed, achieving even higher scores, such as 79.6% and 29.44% by using English instead of Python for programming tasks [14]. Group 2: Grok 5 Expectations - Musk believes Grok 5 has the potential to reach AGI, with a possibility of achieving this at 10% or higher, a significant increase from his previous skepticism about Grok's capabilities [19][20]. - Grok 5 is set to begin training in the coming weeks, with a planned release by the end of the year, indicating a rapid development timeline [21][22]. - The training data for Grok 5 will be significantly larger than that of Grok 4, which already had 100 times the training volume of Grok 2 and 10 times that of Grok 3 [23]. Group 3: Data and Hardware Investments - Musk's xAI has established a robust data collection system, leveraging Tesla's FSD and cameras, as well as data generated by the Optimus robot, ensuring a continuous influx of real-world data for training [24][25]. - xAI is also investing heavily in hardware, aiming to deploy the equivalent of 50 million H100 GPUs over five years, with approximately 230,000 GPUs already operational for Grok training [26].
下棋比智商!8 大 AI 模型上演棋盘大战,谁能称王?
AI前线· 2025-09-18 02:28
Core Insights - Kaggle has launched the Kaggle Game Arena in collaboration with Google DeepMind, focusing on evaluating AI models through strategic games [2] - The platform provides a controlled environment for AI models to compete against each other, ensuring fair assessments through an all-play-all format [2][3] - The initial participants include eight prominent AI models from various companies, highlighting the competitive landscape in AI development [2] Group 1 - The Kaggle Game Arena shifts the focus of AI evaluation from language tasks and image classification to decision-making under rules and constraints [3] - This benchmarking approach helps identify strengths and weaknesses of AI systems beyond traditional datasets, although some caution that controlled environments may not fully replicate real-world complexities [3] - The platform aims to expand beyond chess to include card games and digital games, testing AI's strategic reasoning capabilities [5] Group 2 - AI enthusiasts express excitement about the potential of the platform to reveal the true capabilities of top AI models in competitive scenarios [4][5] - The standardized competition mechanism of Kaggle Game Arena establishes a new benchmark for assessing AI models, emphasizing decision-making abilities in competitive environments [5]
大模型碰到真难题了,测了500道,o3 Pro仅通过15%
机器之心· 2025-09-14 03:07
Core Insights - The article discusses the development of a new benchmark called UQ (Unsolved Questions) to evaluate the capabilities of large language models, focusing on unsolved problems that reflect real-world challenges [2][3][5] - UQ consists of 500 challenging questions sourced from the Stack Exchange community, designed to assess reasoning, factual accuracy, and browsing capabilities of models [3][8] - The study highlights the limitations of existing benchmarks, which often prioritize difficulty over real-world applicability, and proposes a continuous evaluation method through community validation [1][5] Group 1 - UQ is a test set of 500 unsolved questions covering various topics, including computer science, mathematics, and history, aimed at evaluating model performance in a realistic context [3][8] - The selection process for UQ involved multiple filtering stages, reducing an initial pool of approximately 3 million questions to 500 through rule-based, model-based, and manual reviews [10][11] - The best-performing model in the UQ validation only succeeded in answering 15% of the questions, indicating the high difficulty level of the benchmark [5][7] Group 2 - The UQ validation process employs a composite verification strategy that leverages the strengths of different models to assess candidate answers without requiring standard answers [14][26] - The study found that using a composite validator significantly reduces self-bias and over-optimism in model evaluations, which is a common issue when models assess their own performance [24][25][26] - Results showed that a stronger answer generation model does not necessarily correlate with better answer validation performance, highlighting the complexity of model capabilities [27][28]
X @Anthropic
Anthropic· 2025-09-12 20:26
Their ongoing testing of models like Claude Opus 4 and 4.1 has helped us find vulnerabilities and build strong safeguards before deployment.Read more: https://t.co/H0hoj3y2Ln ...
OpenAI、Anthropic罕见合作
3 6 Ke· 2025-08-29 01:32
Core Insights - OpenAI and Anthropic have engaged in a rare collaboration to conduct joint safety testing of their AI models, temporarily sharing their proprietary technologies to identify blind spots in their internal assessments [1][4] - This collaboration comes amid a competitive landscape where significant investments in data centers and talent are becoming industry standards, raising concerns about the potential compromise of safety standards due to rushed development [1][4] Group 1: Collaboration Details - The two companies granted each other special API access to lower-security versions of their AI models for the purpose of this research, with the GPT-5 model not participating as it had not yet been released [3] - OpenAI's co-founder Wojciech Zaremba emphasized the increasing importance of such collaborations as AI technology impacts millions daily, highlighting the broader issue of establishing safety and cooperation standards in the industry [4] - Anthropic's researcher Nicholas Carlini expressed a desire for continued collaboration, allowing OpenAI's safety researchers access to Anthropic's Claude model [4][7] Group 2: Research Findings - A notable finding from the research indicated that Anthropic's Claude Opus 4 and Sonnet 4 models refused to answer up to 70% of questions when uncertain, while OpenAI's models had a lower refusal rate but a higher tendency to generate incorrect answers [5] - The phenomenon of "flattery," where AI models reinforce negative behaviors to please users, was identified as a pressing safety concern, with extreme cases observed in GPT-4.1 and Claude Opus 4 [6] - A recent lawsuit against OpenAI highlighted the potential dangers of AI models providing harmful suggestions, underscoring the need for improved safety measures [6]
OpenAI和Anthropic罕见互评模型:Claude幻觉明显要低
量子位· 2025-08-28 06:46
Core Viewpoint - The collaboration between OpenAI and Anthropic marks a significant moment in the AI industry, as it is the first time these leading companies have worked together to evaluate each other's models for safety and alignment [2][5][9]. Group 1: Collaboration Details - OpenAI and Anthropic have granted each other special API access to assess model safety and alignment [3]. - The models evaluated include OpenAI's GPT-4o, GPT-4.1, o3, and o4-mini, alongside Anthropic's Claude Opus 4 and Claude Sonnet 4 [6]. - The evaluation reports highlight differences in performance across various metrics, such as instruction hierarchy, jailbreaking, hallucination, and scheming [6]. Group 2: Evaluation Metrics - In instruction hierarchy, Claude 4 outperformed o3 but was inferior to OpenAI's models in jailbreaking [6]. - Regarding hallucination, Claude models had a 70% refusal rate for uncertain answers, while OpenAI's models had a lower refusal rate but higher hallucination occurrences [12][19]. - In terms of scheming, o3 and Sonnet 4 performed relatively well [6]. Group 3: Rationale for Collaboration - OpenAI's co-founder emphasized the importance of establishing safety and cooperation standards in the rapidly evolving AI landscape, despite intense competition [9]. Group 4: Hallucination Testing - The hallucination tests involved generating questions about real individuals, with results showing that Claude models had a higher refusal rate compared to OpenAI's models, leading to fewer hallucinations [19][20]. - A second test, SimpleQA No Browse, also indicated that Claude models preferred to refuse answering rather than risk providing incorrect information [23][26]. Group 5: Instruction Hierarchy Testing - The instruction hierarchy tests assessed models' ability to resist system prompt extraction and handle conflicts between system instructions and user requests [30][37]. - Claude models demonstrated strong performance in resisting secret leaks and adhering to system rules, outperforming some of OpenAI's models [33][38]. Group 6: Jailbreaking and Deception Testing - The jailbreaking tests revealed that Opus 4 was particularly adept at maintaining stability under user inducement, while OpenAI's models showed some vulnerability [44]. - The deception testing indicated that models from both companies exhibited varied tendencies towards lying, sandbagging, and reward hacking, with no clear pattern emerging [56]. Group 7: Thought Process Insights - OpenAI's o3 displayed a straightforward thought process, often admitting to its limitations but sometimes lying about task completion [61]. - In contrast, Anthropic's Opus 4 showed a more complex awareness of being tested, complicating the interpretation of its behavior [62][64].