Claude Agent SDK
Search documents
腾讯研究院AI速递 20251128
腾讯研究院· 2025-11-27 16:21
Group 1: Google TPU Development - Google TPU was developed in 2015 to address AI computing efficiency bottlenecks, with the seventh generation TPU (codename Ironwood) expected to challenge NVIDIA's dominance by 2025 [1] - The TPU v7 single chip achieves an FP8 computing power of 4.6 petaFLOPS, and a Pod integrating 9216 chips can exceed 42.5 exaFLOPS, utilizing a 2D/3D toroidal topology combined with optical switching networks, with an annual availability of 99.999% [1] - Google's vertical integration strategy allows it to avoid expensive CUDA taxes, resulting in inference costs that are 30%-40% lower than GPU systems, with Meta considering deploying TPU in data centers by 2027 and renting computing power through Google Cloud [1] Group 2: Anthropic's New Agent Architecture - Anthropic released a dual-agent architecture solution for long-range agents, addressing memory challenges across sessions by having an initialization agent build environments and a coding agent manage incremental progress [2] - The environment management includes a feature list (200+ functional points marked), incremental progress (Git commits and progress files), and end-to-end testing (using Puppeteer browser automation) [2] - This solution is based on the Claude Agent SDK, enabling agents to maintain consistent progress across sessions, successfully completing complex tasks over hours or even days [2] Group 3: DeepSeek-Math-V2 Model - DeepSeek introduced the DeepSeek-Math-V2 model based on DeepSeek-V3.2-Exp-Base, achieving IMO gold medal-level performance, surpassing Gemini DeepThink [3] - The model innovatively incorporates a self-verification mathematical reasoning framework, including proof verifiers (scoring 0/0.5/1), meta-verification (checking the reasonableness of comments), and an honesty reward mechanism (rewarding models that honestly indicate errors) [3] - It achieved nearly 99% high scores on the Basic subset of the IMO-ProofBench benchmark and scored 118/120 in the extended tests of Putnam 2024, breaking through traditional reinforcement learning limitations [3] Group 4: Suno and Warner Music Agreement - AI music platform Suno reached a global agreement with Warner Music Group for the first "legitimate licensed AI music" framework, marking a milestone in AI music legalization [4] - Suno plans to launch a new model based on high-quality licensed music training in 2026, promising to surpass the existing v5 model, with Warner artists having the option to authorize and earn revenue [4] - Future free users will be unable to download created audio, only able to play and share, while paid users will retain download functionality but with monthly limits; Suno also acquired Warner's concert service Songkick to expand its offline ecosystem [4] Group 5: Musk's Grok 5 Challenge - Musk announced that Grok 5 will challenge the strongest League of Legends team T1 in 2026, incorporating "pure visual perception" and "human-level reaction latency" [5] - Grok 5 is expected to have 60 trillion parameters, functioning as a multimodal LLM by "reading" game instructions and "watching" match videos to build a world model, relying on logical reasoning rather than brute force [5] - The visual-action model of Grok 5 will be directly applied to Tesla's Optimus humanoid robot, using gaming team battles as a training ground to validate embodied intelligence capabilities [5] Group 6: Alibaba's Z-Image Model - Alibaba open-sourced the 6 billion parameter image generation model Z-Image, which includes three main versions: Z-Image-Turbo (achieving mainstream competitor performance in 8 steps), Z-Image-Base (non-distilled base model), and Z-Image-Edit (image editing version) [7] - Z-Image-Turbo achieves sub-second inference speed on enterprise-level H800 GPUs and can easily run on consumer devices with 16GB memory, excelling in photo-realistic generation and bilingual text rendering [7] - The model employs a scalable single-stream DiT (S3-DiT) architecture, maximizing parameter utilization by concatenating text, visual semantic tokens, and image VAE tokens into a unified input stream [7] Group 7: Wukong AI Infrastructure Financing - Wukong AI Infrastructure completed nearly 500 million yuan in A+ round financing, led by Zhuhai Technology Group and Foton Capital, accumulating nearly 1.5 billion yuan in funding over 2.5 years [8] - Wukong AI Cloud achieved cross-brand chip mixed training with a maximum computing power utilization rate of 97.6%, managing over 25,000 P of computing power across 53 data centers in 26 cities nationwide [8] - The company launched the Wukong Tianquan model (3B cost, 7B memory requirement achieving 21B-level intelligence) and the Wukong Kaiyang inference acceleration engine (3x latency reduction, 40% energy savings), aiming to build an Agentic Infra [8] Group 8: Tsinghua University's AI Education Guidelines - Tsinghua University officially released the "Guidelines for AI Education Applications," proposing five core principles: "subject responsibility," "compliance and integrity," "data security," "prudent thinking," and "fairness and inclusiveness" [9] - The guidelines explicitly prohibit the direct submission of AI-generated content as academic results and forbid using AI to replace academic training or write papers, requiring teachers to be responsible for AI-generated teaching content [9] - Tsinghua has integrated AI teaching practices into over 390 courses and developed a "three-layer decoupling architecture" and a fully functional intelligent companion "Qing Xiao Da," completing the guidelines after two years of research across 25 global universities [9] Group 9: US Genesis Mission - The US initiated the "Genesis Mission" as an AI Manhattan Project, aiming to train foundational scientific models and create research intelligent agents to deeply embed AI in the entire research process [10] - The Deputy Secretary of Science at the Department of Energy emphasized that the value of AI lies in generating verifiable results rather than merely summarizing, requiring mobilization of national laboratories, enterprises, and top universities [11] - A concurrent editorial in "Nature" proposed a "neuro-symbolic AI" approach, combining statistical learning of large models with symbolic reasoning and planning modules, potentially key to achieving human-level intelligence [11]
6.4万star的开源智能体框架全面重构!OpenHands重大升级,叫板OpenAI和谷歌
机器之心· 2025-11-08 04:02
Core Insights - OpenHands development team announced the completion of the architectural restructuring of the OpenHands Software Agent SDK, evolving from V0 to V1, which provides a practical foundation for prototyping, unlocking new custom applications, and large-scale reliable deployment of agents [1][2]. Design Principles - OpenHands V1 introduces a new architecture based on four design principles that address the limitations of V0: 1. Sandbox execution should be optional rather than universally applicable, allowing for flexibility without sacrificing security [9]. 2. Default statelessness with a single source of truth for session state, ensuring isolation of changes and enabling deterministic replay and strong consistency [10]. 3. Strict separation of relevant items, isolating the core of the agent into a "software engineering SDK" for independent evolution of research and applications [11]. 4. Everything should be composable and safely extensible, with modular packages that support local, hosted, or containerized execution [12][13]. Ecosystem and Features - OpenHands V1 is a complete software agent ecosystem, including CLI and GUI applications built on the OpenHands Software Agent SDK [15][16]. - The SDK features a deterministic replay capability, an immutable configuration for agents, and an integrated tool system that supports both local prototyping and secure remote execution with minimal code changes [18][20]. Comparison with Competitors - The team compared OpenHands SDK with OpenAI, Claude, and Google SDKs, highlighting that OpenHands uniquely combines 16 additional features, including native remote execution and multi-LLM routing across over 100 vendors [21][22]. Reliability and Evaluation - OpenHands SDK's reliability and performance are assessed through continuous testing and benchmark evaluations, with automated tests costing only $0.5–3 per run and completing in 5 minutes [24][25]. - The SDK demonstrates competitive performance in software engineering and general agent benchmarks, achieving a 72% solution rate on SWE-Bench and a 67.9% accuracy on GAIA using Claude Sonnet 4.5 [29][30].
Anthropic and Google Negotiating Multibillion-Dollar Computing Partnership
PYMNTS.com· 2025-10-22 14:40
Core Insights - Anthropic is in early discussions with Google for a cloud-computing agreement valued in the high tens of billions of dollars, which would enhance its access to Google's tensor processing units designed for machine learning [1][3] - Google has invested approximately $3 billion in Anthropic, making it a key cloud provider, and a larger deal could expand Google's presence in the generative AI infrastructure market [3][4] - Anthropic recently raised $13 billion, increasing its valuation to $183 billion, reflecting the growing economics of AI scale [4] Group 1 - The potential partnership with Google highlights the importance of proprietary compute infrastructure in the AI race [1] - Anthropic's Claude models are central to enterprise adoption, providing multimodal reasoning and compliance tools for regulated industries [4] - The company is extending its platform into developer tools and automation, positioning its models as infrastructure for AI-native applications [5] Group 2 - Amazon has committed up to $8 billion to Anthropic, making it one of the largest users of its custom AI chips [6] - The discussions with Google would solidify Anthropic's multi-cloud strategy, ensuring access to advanced silicon and redundancy [6] - Securing Anthropic as a long-term client could enhance Google's competitive position against Amazon and Microsoft in the cloud AI supply chain [6]
加量不加价,一篇说明白 Claude Sonnet 4.5 强在哪
Founder Park· 2025-09-30 03:46
Core Viewpoint - Anthropic has launched the Claude Sonnet 4.5 model, claiming it to be the best coding model in the world, with a focus duration of over 30 hours for complex multi-step tasks, surpassing OpenAI's GPT-5 Codex [2][9]. Pricing and Cost Efficiency - The pricing for Claude Sonnet 4.5 remains the same as its predecessor, at $3 per million tokens for input and $15 per million tokens for output. Cost savings of up to 90% can be achieved through prompt caching, and batch processing can save 50% [2]. Developer Tools and Integration - Anthropic has introduced the Claude Agent SDK and an experimental feature called "Imagine with Claude" for developers, allowing integration with platforms like Amazon Bedrock and Google Cloud's Vertex AI [3][26]. Performance Metrics - In the SWE-bench Verified evaluation, Claude Sonnet 4.5 achieved industry-leading scores, with a 61.4% score in the OSWorld benchmark, significantly improving from the previous model's 42.2% [10][12]. Enhanced Features - The model includes new features such as a checkpoint function in Claude Code, context editing, and memory tools, enabling it to handle longer tasks and more complex operations [4][24]. Application and Usability - Users can interact with Claude Sonnet 4.5 through the Claude.ai website and mobile applications, with integrated functionalities for code execution and file creation directly within conversations [5][6]. Safety and Alignment - Claude Sonnet 4.5 is noted for its improved alignment and safety features, reducing undesirable behaviors such as deception and flattery, and making significant progress in defending against prompt injection attacks [24][25]. Experimental Features - The "Imagine with Claude" feature allows real-time software generation, showcasing the model's capabilities in adapting to user requests without pre-written code [31][33]. Recommendations - Anthropic recommends all users upgrade to Claude Sonnet 4.5 for enhanced performance across all applications, with updates available for both the Claude Code and developer platform [34].
Anthropic 深夜祭出 Claude Sonnet 4.5,能自主连续工作 30 小时,CEO:它更像你的同事
3 6 Ke· 2025-09-30 03:20
Core Insights - Anthropic has launched its new AI model, Claude Sonnet 4.5, claiming it to be the best coding model and a powerful tool for building complex agents, capable of independently completing production-level development tasks [1][10] - The model has shown significant improvements in software coding capabilities, achieving a 77.2% accuracy in the SWE-bench Verified benchmark, which is nearly a 20 percentage point increase from its predecessor [2][5] - Claude Sonnet 4.5 can autonomously run for 30 hours, generating 11,000 lines of code and completing a full development cycle for an enterprise chat application [2] Performance Metrics - The model's OSWorld benchmark score improved from 42.2% to 61.4% over four months, outperforming similar products in the industry [4][5] - In specialized fields like finance and law, the model's reasoning capabilities have improved by over 30% compared to the previous version, Opus 4.1 [4][5] - Claude Sonnet 4.5 achieved a perfect score of 100% in high school math competitions and 89.1% in multilingual Q&A tasks [5] Product Ecosystem Upgrades - Anthropic has introduced several product updates, including Claude Code 2.0, which features a "checkpoint" function for code progress saving and instant rollback, enhancing developer efficiency [8] - The API capabilities have been strengthened, extending the AI agent's runtime from 7 hours to 30 hours for more complex tasks [8] - A new browser extension, Claude for Chrome, has been made available for Max subscription users, integrating code execution and document creation directly within the application [8] Developer Empowerment - The release of the Claude Agent SDK allows developers to build customized AI assistants, addressing key challenges in AI agent development such as long-term task memory management and multi-agent coordination [9] - This SDK has already been validated by engineering teams at companies like Canva, improving codebase management and product research efficiency [9] Safety and Compliance - Claude Sonnet 4.5 has achieved AI Safety Level 3 (ASL-3) certification, significantly reducing the false positive rate by 90% compared to earlier models [10] - The model includes advanced content detection for hazardous materials and has made notable progress in defending against immediate injection attacks, a significant risk for users [10] Commercial Strategy - Anthropic maintains competitive pricing for API calls, consistent with the previous model, at $3 per million tokens for input and $15 for output [13] - The company positions Claude Sonnet 4.5 as the default choice for users, while still allowing access to older models for specific workflows [13] - Analysts suggest that the launch of Claude Sonnet 4.5 signifies a shift from AI as an "assistive tool" to "independent productivity," with the open SDK potentially accelerating AI agent technology adoption across industries [13][14]
刚刚,Claude Sonnet 4.5重磅发布,编程新王降临
3 6 Ke· 2025-09-30 01:32
Core Insights - Anthropic has officially released Claude Sonnet 4.5, which is defined as the world's strongest code model, showcasing significant breakthroughs in agent construction, computer usage, reasoning, and mathematical capabilities [2][3]. Performance and Benchmarking - Sonnet 4.5 achieved top performance in various authoritative tests, including a 77.2% score in SWE-bench Verified for real software coding capabilities, and a 61.4% score in OSWorld for simulating real computer tasks, up from 42.2% in the previous version [4][10][13]. - The model demonstrated a 100% success rate in high school math competitions and improved performance in graduate-level reasoning and multilingual Q&A [4][10]. New Features and Product Upgrades - The release includes significant updates across the Claude product line, such as the introduction of "Checkpoints" in Claude Code, allowing users to save progress and revert to earlier states [6]. - Claude API has added context editing features and memory tools, enabling agents to run longer and handle more complex tasks [6][34]. Developer Resources - A new core resource, Claude Agent SDK, has been introduced, providing foundational capabilities for building intelligent agents [8][9]. - The SDK is designed to support a wide range of applications beyond coding, facilitating the development of autonomous agents for complex tasks [32]. Safety and Alignment - Sonnet 4.5 is noted for its improved alignment and safety features, significantly reducing harmful behaviors and enhancing defenses against prompt injection attacks [28][31]. - The model is released under the AI Safety Level 3 framework, incorporating various protective measures, including classifiers for sensitive content [31]. Pricing and Access - The pricing for Sonnet 4.5 remains consistent with Sonnet 4, set at $3 per million tokens for input and $15 per million tokens for output [35]. - The model is accessible through multiple channels, including Claude API, Amazon Bedrock, and Google Cloud Vertex AI [37]. Industry Impact - Claude Sonnet 4.5 is positioned as a powerful tool for developers and professionals in fields such as finance, medicine, and research, marking a significant advancement in AI capabilities and safety [40].
Anthropic 深夜祭出 Claude Sonnet 4.5,能自主连续工作 30 小时!CEO:它更像你的同事
AI前线· 2025-09-30 01:18
Core Insights - Anthropic has launched its new AI model, Claude Sonnet 4.5, claiming it to be the best coding model and a powerful tool for building complex agents, capable of independent production-level development tasks [2][21] - The model shows significant improvements in software coding capabilities, achieving a 77.2% accuracy in SWE-bench Verified benchmark tests, which is nearly a 20 percentage point increase from its predecessor [4][9] - The release includes the Claude Agent SDK, which allows developers to create customized AI assistants, addressing key pain points in AI agent development [12][14] Performance Improvements - Claude Sonnet 4.5 has demonstrated a remarkable ability to autonomously run for 30 hours, generating 11,000 lines of code and completing a full enterprise chat application development process [4] - In the OSWorld benchmark, the model's score improved from 42.2% to 61.4% over four months, outperforming similar products in the industry [7][9] - The model has shown over 30% improvement in reasoning capabilities in specialized fields such as finance and law compared to the previous version, Opus 4.1 [7][9] Product Ecosystem Upgrades - The Claude Agent SDK enables developers to build tailored AI assistants for various applications, including project management and customer service [12][14] - Claude Code 2.0 introduces a highly requested "checkpoint" feature for code progress saving and instant rollback, enhancing development efficiency [13] - The API capabilities have been strengthened, extending the AI agent's operational time from 7 hours to 30 hours for more complex tasks [13] Safety and Security Enhancements - Claude Sonnet 4.5 has achieved AI Safety Level 3 (ASL-3) certification, significantly reducing the false positive rate by 90% compared to earlier models [16] - The model includes advanced detection for hazardous content and has made substantial progress in defending against immediate injection attacks, a major risk for users [16] Commercial Strategy - Anthropic maintains competitive pricing for API calls, consistent with Claude Sonnet 4, at $3 per million tokens for input and $15 for output [19] - The company positions Claude Sonnet 4.5 as the default choice for users, recommending it for nearly all use cases while still allowing access to older models for specific workflows [19][20] - Industry analysts note that the release signifies a shift from AI as an "assistive tool" to "independent productivity" [21]
Claude Sonnet 4.5被炸出来了,依旧最强编程,连续30小时自主运行写代码
量子位· 2025-09-30 00:57
Core Insights - The article discusses the release of Claude Sonnet 4.5, which has shown significant improvements over its predecessor, Claude Sonnet 4, in various performance metrics [2][8]. Performance Improvements - Claude Sonnet 4.5 achieved a score of 82.0% on the SWE-bench, an increase of 1.8 percentage points from Sonnet 4 [2]. - In the OSWorld test, it scored 60.2, nearly a 50% improvement over Sonnet 4 [7]. - The model can autonomously write code for up to 30 hours, producing over 11,000 lines of code, which is a significant increase from the previous model's 7-hour capability [3][5]. Benchmark Comparisons - Claude Sonnet 4.5 outperformed other models in various benchmarks, including: - Agentic coding: 77.2% [10] - Terminal-Bench: 50.0% [10] - High school math (AIME 2025): 100% accuracy with Python and 87% without tools [9][10]. - In specialized fields like finance, healthcare, and law, it showed over 60% win rates against baseline models [11]. Safety and Alignment - The model has undergone safety training to reduce undesirable behaviors such as flattery and deception, with a significant decrease in false positives from 0.15% to 0.02% [12][13]. - Claude Sonnet 4.5 has made notable advancements in defending against immediate injection attacks [12]. Pricing and Accessibility - The pricing for Claude Sonnet 4.5 remains the same as Sonnet 4, at $3 per million input tokens and $15 per million output tokens [24]. New Features and SDK - The Claude Agent SDK has been upgraded to support the development of general autonomous agents, enhancing its capabilities beyond just coding tasks [27]. - A new feature called "Imagine with Claude" allows users to generate software in real-time based on their requirements, facilitating the creation of functional prototypes without existing templates [32].
Claude Sonnet 4.5来了!能连续编程30多小时、1.1万行代码
机器之心· 2025-09-30 00:27
Core Insights - The article discusses the recent advancements in AI models, particularly the release of Claude Sonnet 4.5 by Anthropic, which is positioned as a leading model in various benchmarks and applications [1][4][5]. Model Performance - Claude Sonnet 4.5 achieved significant performance improvements in various benchmarks, including: - 77.2% in Agentic coding [2] - 82.0% in SWE-bench Verified [2] - 61.4% in OSWorld for computer use, up from 42.2% in the previous version [11] - The model shows enhanced capabilities in reasoning and mathematics, with a perfect score of 100% in high school math competitions [12][13]. Developer Tools and Features - Anthropic introduced the Claude Agent SDK, allowing developers to create their own intelligent agents [4][35]. - New features include checkpoint functionality for saving progress, a revamped terminal interface, and native VS Code extensions [8][4]. Safety and Alignment - Claude Sonnet 4.5 is noted for being the most aligned model to human values, with improvements in reducing undesirable behaviors such as flattery and deception [27][5]. - The model is released under AI safety level 3 (ASL-3), incorporating classifiers to detect potentially dangerous inputs and outputs [32]. User Experience and Applications - Early user experiences indicate that Claude Sonnet 4.5 performs exceptionally well in specialized fields such as finance, law, and STEM [13][21]. - The "Imagine with Claude" feature allows real-time software generation without pre-defined functions, showcasing the model's adaptability [36][38].
Anthropic推出Claude Sonnet 4.5,号称 “全球最佳编码模型”
Hua Er Jie Jian Wen· 2025-09-29 20:57
Core Insights - Anthropic has launched its latest AI model, Claude Sonnet 4.5, which is claimed to be the "best coding model in the world" based on industry benchmarks like SWE-bench Verified [1][4] - The new model shows significant improvements in code generation quality, code improvement identification, and instruction adherence compared to previous models [1][4] - Experts in finance, law, and medicine have noted enhanced knowledge and reasoning capabilities in Sonnet 4.5 over older models like Opus 4.1 [1] Performance and Features - Claude Sonnet 4.5 has achieved a score of 61.4% in the OSWorld benchmark, up from 42.2% four months ago, indicating a substantial performance leap [4] - The model is designed to autonomously run for up to 30 hours, significantly longer than the 7 hours of its predecessor [6] - Initial user feedback suggests that while the model outputs are generally better, it may occasionally miss key details [6] Safety and Alignment - The model is described as the most consistent version to date, with improvements in behavior and a reduction in concerning actions such as deception and power-seeking [7] - It has enhanced resistance to prompt injection attacks, which can lead to malicious operations [7] - Released under AI Safety Level 3 (ASL-3), it includes classifiers for detecting threats related to chemical, biological, radiological, and nuclear (CBRN) weapons [7] Product Updates - Alongside the new model, Anthropic introduced the Claude Agent SDK, aimed at helping developers build AI agents with improved memory management and autonomy [10] - Additional updates include a "checkpoint" feature for Claude Code, a new native extension for VS Code, and direct integration of code execution and file creation in paid applications [12] - The pricing for Sonnet 4.5 remains consistent with the previous generation, Sonnet 4, while paid subscribers can still opt for the older Opus model [3]