Workflow
Agent Management(智能体管理)
icon
Search documents
刚刚,ChatGPT 和 Claude 同时大更新,不会给 AI 当老板的打工人要被淘汰
3 6 Ke· 2026-02-05 23:04
Core Insights - The AI landscape is witnessing significant advancements with OpenAI's release of GPT-5.3-Codex and Anthropic's Claude Opus 4.6, marking a competitive shift in capabilities and functionalities in AI models [1][15]. Group 1: OpenAI's GPT-5.3-Codex - GPT-5.3-Codex demonstrates self-evolution capabilities, being able to write code, identify bugs, and even train the next generation of AI [4][12]. - The model achieved a notable accuracy increase in the OSWorld-Verified benchmark, rising from 38.2% to 64.7% [4]. - In the SWE-Bench Pro benchmark, GPT-5.3-Codex showed state-of-the-art performance while using fewer tokens than previous models [9]. - OpenAI's model is designed, trained, and deployed on NVIDIA's GB200 NVL72 system, indicating a strong partnership with NVIDIA [14]. Group 2: Anthropic's Claude Opus 4.6 - Claude Opus 4.6 introduces a significant improvement in recall rates, achieving 76% in the MRCR v2 test, compared to 18.5% for its predecessor [19]. - The model supports a context window of 1 million tokens and can output up to 128,000 tokens, allowing for the processing of extensive documents and complex codebases [23]. - In the GDPval-AA evaluation, Claude Opus 4.6 scored 144 points higher than the second-best model, GPT-5.2, showcasing its superiority in high-value tasks [23]. - Anthropic's model is integrated into Excel and PowerPoint, enhancing productivity by generating presentations directly from data [26]. Group 3: Comparative Analysis - GPT-5.3-Codex is characterized as high reliability with low variance, excelling in routine coding and operational tasks [36]. - Claude Opus 4.6 is described as high ceiling with high variance, capable of solving complex problems but occasionally prone to overconfidence [33]. - The shift in focus from prompt engineering to agent management is emphasized, indicating a new era where managing AI capabilities becomes crucial for users [38].