Claude Opus

Search documents
X @Anthropic
Anthropic· 2025-08-15 19:41
Model Capabilities - Claude Opus 4 and 4.1 were given the ability to end a rare subset of conversations on a specific platform [1] Research & Development - The company is conducting exploratory work on potential model welfare [1]
Claude Just Got a Big Update (Opus 4.1)
Matthew Berman· 2025-08-05 23:02
Model Release & Performance - Anthropic 发布了 Claude Opus 4.1%,是对 Claude Opus 4 的升级,尤其在 Agentic 任务、真实世界编码和推理方面 [1] - SWEBench verified 基准测试中,Claude Opus 4.1% 的得分从 Opus 4 的 72.5% 提升至 74.5%,提升了 2 个百分点 [3] - Terminal Bench 基准测试中,Claude Opus 4.1% 的终端使用能力从 39.2% 提升至 43.3%,提升了 4.1 个百分点 [4] - GPQA Diamond(研究生水平推理)基准测试中,Claude Opus 4.1% 的得分从 79.6% 提升至 80.9%,提升了 1.3 个百分点 [4] - Towbench(Agentic 工具使用)基准测试中,Claude Opus 4.1% 在零售方面的得分从 81.4% 提升至 82.4%,提升了 1 个百分点,但在航空方面从 59.6% 下降至 56%,下降了 3.6 个百分点 [5] - 多语言问答基准测试中,Claude Opus 4.1% 的得分从 88.8% 提升至 89.5%,提升了 0.7 个百分点 [5] - Amy 2025 基准测试中,Claude Opus 4.1% 的得分提升了 2.5 个百分点至 78% [5] Competitive Positioning & Future Outlook - 在 SWEBench 和 Terminal Bench 基准测试中,Claude Opus 4.1% 优于 OpenAI 的 GPT-3 和 Gemini 1.5 Pro [5] - 在 GPQA Diamond 和 Agentic 工具使用基准测试中,Claude Opus 4.1% 不及 OpenAI 的 GPT-3 和 Gemini 1.5 Pro [6] - 在高中数学竞赛基准测试中,Claude Opus 4.1% 的得分低于 OpenAI 的 GPT-3 (88.9%) 和 Gemini 1.5 Pro (88%),仅为 78% [6] - Claude 目前被广泛认为是市场上最佳的编码模型,尤其擅长 Agentic 编码和 Agent-driven 开发 [7]
别再乱试了!Redis 之父力荐:写代码、查 bug,这 2 个大模型封神!
程序员的那些事· 2025-07-21 06:50
Core Viewpoint - The article emphasizes that while large language models (LLMs) like Gemini 2.5 PRO can significantly enhance programming capabilities, human programmers still play a crucial role in ensuring code quality and effective collaboration with LLMs [4][11][12]. Group 1: Advantages of LLMs in Programming - LLMs can help eliminate bugs before code reaches users, as demonstrated in the author's experience with Redis [4]. - They enable faster exploration of ideas by generating one-off code for quick testing of solutions [4]. - LLMs can assist in design activities by combining human intuition and experience with the extensive knowledge embedded in LLMs [4]. - They can write specific code segments based on clear human instructions, thus accelerating work progress [5]. - LLMs can fill knowledge gaps, allowing programmers to tackle areas outside their expertise [5]. Group 2: Effective Collaboration with LLMs - Human programmers must avoid "ambient programming" and maintain oversight to ensure code quality, especially for complex tasks [6]. - Providing ample context and information to LLMs is essential for effective collaboration, including relevant documentation and brainstorming records [7][8]. - Choosing the right LLM is critical; Gemini 2.5 PRO is noted for its superior semantic understanding and bug detection capabilities [9]. - Programmers should avoid using integrated programming agents and maintain direct control over the coding process [10][16]. Group 3: Future of Programming with LLMs - The article suggests that while LLMs will eventually take on more programming tasks, human oversight will remain vital for decision-making and quality control [11][12]. - Maintaining control over the coding process allows programmers to learn and ensure that the final output aligns with their vision [12]. - The article warns against ideological resistance to using LLMs, as this could lead to a disadvantage in the evolving tech landscape [13].
X @Elon Musk
Elon Musk· 2025-07-18 18:10
AI Model Comparison - Grok 4 Heavy is considered superior to Claude Opus based on user experience [1] - The user canceled their Claude subscription, indicating a strong preference for Grok 4 Heavy [1]
员工每天花1000美元也要用ClaudeCode!创始人:太贵了,大公司专属,但它比 Cursor 猛!
AI前线· 2025-06-14 04:06
Core Viewpoint - Anthropic's Claude Code is a powerful coding assistant that excels in handling large codebases, but its high cost is a significant barrier to widespread adoption [1][2][3]. Pricing and User Experience - Claude Code's pricing can easily exceed $50 to $200 per month for regular developers, making it less accessible for casual users [1][9][10]. - Users have noted that while Claude Code is more capable than other tools like Cursor, its cost is a deterrent for many [1][2]. - The user experience is described as somewhat cumbersome, lacking multi-modal support, but it significantly outperforms other tools in terms of capability [2][3]. Development Philosophy and Future Vision - Anthropic aims to transform developers from mere code writers to decision-makers regarding code correctness, indicating a shift in the role of developers [4][9]. - The development of Claude Code was influenced by the diverse technology stacks used by engineers, leading to a terminal-based solution that integrates seamlessly into existing workflows [5][6]. Community Feedback and Adoption - The initial community feedback for Claude Code has been overwhelmingly positive, with rapid adoption among internal users at Anthropic [7][8]. - The tool was initially kept internal due to its effectiveness but was later released to the public, confirming its value in enhancing productivity [7][8]. Technical Integration and Functionality - Claude Code operates directly in the terminal, allowing for a flexible and efficient coding experience without the need for new tools or platforms [5][6]. - It can handle various tasks, from simple bug fixes to complex coding challenges, and is designed to work with multiple coding environments [11][19]. Evolution of Programming Paradigms - The introduction of Claude Code represents a significant evolution in programming, moving from manual coding to a more collaborative approach with AI [12][18]. - Developers are encouraged to adapt to this new paradigm where they coordinate AI agents to assist in coding tasks, shifting their focus from writing code to reviewing and managing AI-generated code [18][19]. Future Directions - Anthropic is exploring ways to enhance Claude Code's integration with various tools and platforms, aiming for a more seamless user experience [27][28]. - The company is also considering enabling Claude Code to handle smaller tasks through chat interfaces, further expanding its usability [27][28].
o3-pro答高难题文字游戏引围观,OpenAI前员工讽刺苹果:这都不叫推理那什么叫推理
量子位· 2025-06-13 02:25
Core Viewpoint - OpenAI's latest reasoning model, o3-pro, demonstrates strong reasoning capabilities but has mixed performance in various evaluations, indicating a need for context and specific prompts to maximize its potential [1][2][3][4]. Evaluation Results - o3-pro achieved a correct answer in 4 minutes and 25 seconds during a reasoning test, showcasing its ability to process complex queries [2]. - In official evaluations, o3-pro surpassed previous models like o3 and o1-pro, becoming the best coding model from OpenAI [8]. - However, in the LiveBench ranking, o3-pro showed only a slight advantage over o3 with a score difference of 0.07, and it lagged behind o3 in agentic coding scores (31.67 vs 36.67) [11]. Contextual Performance - o3-pro excels in short context scenarios, showing improvement over o3, but struggles with long context processing, scoring 65.6 compared to Gemini 2.5 Pro's 90.6 in 192k context tests [15][16]. - The model's performance is highly dependent on the background information provided, as noted by user experiences [24][40]. User Insights - Bindu Reddy, a former executive at Amazon and Google, pointed out that o3-pro lacks proficiency in tool usage and agent capabilities [12]. - Ben Hylak, a former engineer at Apple and SpaceX, emphasized that o3-pro's effectiveness increases significantly when treated as a report generator rather than a chat model, requiring ample context for optimal results [22][24][26]. Comparison with Other Models - Ben Hylak found o3-pro's outputs to be superior to those of Claude Opus and Gemini 2.5 Pro, highlighting its unique value in practical applications [39]. - The model's ability to understand its environment and accurately describe tool usage has improved, making it a better coordinator in tasks [30][31]. Conclusion - The evaluation of o3-pro reveals that while it has advanced reasoning capabilities, its performance is contingent on the context and prompts provided, necessitating a strategic approach to maximize its utility in various applications [40][41].
腾讯研究院AI速递 20250528
腾讯研究院· 2025-05-27 15:44
Group 1 - UAE becomes the first country to offer free access to ChatGPT Plus for all citizens, part of a collaboration with OpenAI [1] - Abu Dhabi will establish the Stargate UAE high-performance AI data center, supporting a 1 GW computing cluster with an initial target of 200 MW capacity [1] - The collaboration is part of OpenAI's "nation-focused" initiative, with UAE committing to match US funding, potentially totaling up to $20 billion [1] Group 2 - OpenAI has enabled singing capabilities for GPT-4o, seen as a response to Google's Gemini 2.5 Pro and Veo3 releases [2] - Google's Gemini 2.5 Pro has outperformed OpenAI and Claude models in several benchmark tests [2] - Analysts believe that the singing feature of GPT-4o is insufficient to regain market leadership, emphasizing the need for OpenAI to launch GPT-5 soon [2] Group 3 - Claude Opus successfully solved a stubborn bug that had troubled a veteran C++ engineer for four years, taking only a few hours [3] - The AI identified the root cause of the issue through analysis of code libraries and architecture comparisons, which had previously stumped other models [3] - Despite its debugging prowess, AI is still considered to be at a beginner level in writing new code [3] Group 4 - French non-profit AI research organization Kyutai launched Unmute, a modular voice AI system that can quickly add voice interaction capabilities to any text LLM [4] - Unmute features low latency (200-350 ms), streaming speech-to-text and text-to-speech, full-duplex interaction, and 10-second voice cloning, supporting over 70 emotional styles [5] - Kyutai plans to fully open-source Unmute in the coming weeks, including STT (1B parameters) and TTS (2B parameters) models and code [5] Group 5 - Alibaba Tongyi launched QwenLong-L1-32B, a large model addressing long-context reasoning issues, with a maximum context length of 130,000 tokens [6] - The team identified two core challenges: low training efficiency and instability, proposing progressive context expansion techniques and a mixed reward mechanism [6] - QwenLong-L1-32B outperforms models like OpenAI-o3-mini and Qwen3-235B-A22B, showing significant advantages in long document analysis [6] Group 6 - Mita AI Search introduced a new "Ultra" model, achieving a response speed of 400 tokens per second, with most queries answered within 2 seconds [7] - The new model utilizes kernel fusion on GPUs and dynamic compilation optimization on CPUs, achieving performance breakthroughs on a single H800 GPU [7] - Mita offers both "Ultra" and "Ultra·Thinking" modes optimized for different types of questions, along with a temporary speed test site for user experience [7] Group 7 - Thunderbird officially released the AI glasses X3 Pro, featuring a custom large model and full-color display, priced at 8,999 yuan [8] - The X3 Pro utilizes a 4nm Qualcomm Snapdragon AR1 platform and proprietary Firefly light engine with RayNeo waveguide technology, achieving a brightness of 3,500 nits (peak 6,000 nits) and weighing only 76g [8] - The product is available for pre-order and will ship on June 15, supporting AI Agent store and real-world navigation features [8] Group 8 - The core team of Meta's Llama faces significant talent loss, with 11 out of 14 core authors having left, leaving only 3 remaining [10] - Among the departed, 5 joined the French AI open-source startup Mistral, including two main architects of Llama [10] - Meta is under pressure from open-source models like DeepSeek and Qwen, despite investing billions, lacking a dedicated "inference" model [10] Group 9 - The Beihang University team proposed the "Flying-on-a-Word" (Flow) task, enabling drone control through language commands, filling a gap in low-level language interaction control research [11] - The team constructed the UAV-Flow benchmark dataset, containing 30,000 real-world flight trajectories across eight major movement types [11] - The research addressed drone computational limitations by performing model inference at the ground station and providing real-time feedback for control commands [11] Group 10 - NVIDIA experts recommend that students integrate multiple skills and enhance adaptability, not limited to computer science backgrounds, to stand out in the job market [12] - Job seekers should clarify their interests in the AI field, responsibly use AI tools, and build industry connections for career development opportunities [12] - Candidates can showcase their technical abilities, professional knowledge, and innovative thinking through project examples to excel in interviews [12]
腾讯研究院AI速递 20250516
腾讯研究院· 2025-05-15 14:38
Group 1: Regulatory Developments - The U.S. Senator proposed a bill requiring companies like NVIDIA and AMD to embed geolocation tracking in high-end GPUs and AI chips, effective in six months [1] - The regulation covers AI processors, high-performance servers, and high-end graphics cards like the RTX 5090, aimed at preventing strategic hardware from flowing to unauthorized countries [1] - Chip manufacturers will be responsible for product tracking, and the bill mandates annual assessments for three years, potentially leading to more restrictions [1] Group 2: AI Model Updates - OpenAI officially launched the GPT-4.1 model in ChatGPT, available for Plus, Pro, and Team users, with enterprise and education users to gain access in the coming weeks [2] - GPT-4.1 shows excellent performance in coding tasks and instruction adherence, with significantly improved generation speed, serving as an ideal replacement for previous models [2] - The context window for ChatGPT's GPT-4.1 is limited to 128k tokens, falling short of the promised 1 million tokens in the API version, disappointing users [2] Group 3: New AI Models and Features - Anthropic plans to release new versions of Claude Sonnet and Opus, featuring "extreme reasoning" capabilities that establish a dynamic loop between reasoning and tool usage [3] - The new models can autonomously pause, reassess problems, and adjust strategies, with capabilities to automatically test and correct errors in code generation tasks [3] - A new model, codenamed Neptune, is reportedly in testing, supporting a maximum context length of 128k tokens [3] Group 4: Advancements in Voice Technology - MiniMax's new voice model, Speech-02, surpasses OpenAI and ElevenLabs in metrics like word error rate and speaker similarity, achieving state-of-the-art levels [4][5] - Speech-02 enables true zero-shot voice cloning and employs an innovative Flow-VAE architecture, requiring only a few seconds of audio to replicate speaker characteristics [5] - The model supports 32 languages and allows flexible control over voice tone and emotional modulation, costing only a quarter of ElevenLabs' competitors, marking a shift towards personalized AI voice technology [5] Group 5: Browser and Audio Innovations - Tencent launched the Yuanbao browser plugin for Chrome, offering features like word highlighting for questions, content summarization, foreign webpage translation, and one-click bookmarking [6] - The plugin includes a floating ball and sidebar for easy access to screenshot questions, file uploads, and content searches, enhancing web browsing efficiency [6] - Stability AI partnered with Arm to introduce the Stable Audio Open Small model, the fastest audio generation model for mobile, capable of generating 11 seconds of audio in 8 seconds [7] - The model, with 341 million parameters, is designed for short audio and sound effect generation, using data from copyright-free sources, but currently only supports English prompts [7] Group 6: Video Generation and Gaming AI - Alibaba released the open-source Wan2.1-VACE video generation model, supporting multiple tasks like text-to-video and image reference generation, usable on consumer-grade graphics cards [8] - The model comes in two versions: 1.3B (supporting 480P) and 14B (supporting 720P), utilizing an innovative video condition unit for various input types [8] - Tencent's mixed Yuan model developed an intelligent NPC system for the game "BUD," enabling autonomous actions, personalized interactions, emotional expression, and memory reasoning [10] - The game achieved over 20 million AI dialogues within three months, with the upcoming release of mixed image version 2.0 aimed at enhancing the AI product matrix [10] Group 7: AI Opportunities and Challenges - Sequoia Capital detailed the "trillion-dollar AI opportunity," emphasizing that AI is disrupting both software and service profit pools, with the application layer being the most valuable [12] - The emerging economy of intelligent agents will not only convey information but also facilitate transactions, track relationships, and build trust, leading to a nested economic network of human-machine collaboration [12] - The industry faces three major technical challenges: persistent identity authentication for intelligent agents, seamless communication protocol development, and security assurance, entering a new era of "high leverage, low certainty" [12]
新版Claude曝光:“极限推理”成最大亮点
量子位· 2025-05-15 04:26
Core Viewpoint - OpenAI has launched GPT-4.1 for free, while Anthropic is expected to release new models, Claude Sonnet and Claude Opus, focusing on "Extreme reasoning" capabilities [1][3]. Group 1: New Features of Claude Models - The new "Extreme reasoning" feature establishes a dynamic loop between reasoning and tool usage, allowing for smarter problem handling [2]. - The model pauses and reevaluates problems when faced with difficulties, adjusting its strategy as needed [7]. - It can automatically adjust its direction if it encounters challenges or provides inaccurate answers, mimicking human thought processes [8]. Group 2: Code Generation Capabilities - For code generation tasks, the model tests the generated code and corrects errors instead of merely outputting results [9]. - The architecture of the new model is designed to adapt to various tasks and scenarios, reducing reliance on human supervision [10]. Group 3: Human-like Reasoning - The model can engage in deep reflection based on context rather than just statistical language generation [11]. - This collaborative reasoning approach brings the new model closer to human-like thinking, allowing it to reason rather than function solely as a "calculator" [12]. Group 4: Community Reactions and Testing - Some users express skepticism about the claims, suggesting potential hype, while others defend the credibility of the source, The Information [13][14]. - There are reports of a model called Claude Neptune being tested, which is suspected to be Claude 3.8 with a maximum token count of 128k [17].