智能体编程
Search documents
连续干7小时“不累”,OpenAI最强编程模型GPT-5-Codex来了
3 6 Ke· 2025-09-16 02:07
Core Insights - OpenAI has released GPT-5-Codex, an optimized version of GPT-5 specifically for software engineering, enhancing its programming capabilities [1][2] - The model can dynamically adjust its thinking time based on task complexity, allowing it to work independently on large tasks for over 7 hours [1][4] - GPT-5-Codex has shown improved accuracy in benchmark tests compared to GPT-5, with a reported accuracy of 74.5% in software engineering tasks [4][5] Group 1: Model Features and Performance - GPT-5-Codex is designed for complex engineering tasks, including project construction, feature addition, debugging, and code review [4] - The model's accuracy in code refactoring tasks is 51.3%, significantly higher than GPT-5's 33.9% [5] - In code reviews, GPT-5-Codex has a lower error comment rate of 4.4% compared to GPT-5's 13.7%, and a higher rate of high-impact comments at 52.4% [9][10] Group 2: Developer Tools and Integration - GPT-5-Codex is integrated into various developer tools, including Codex CLI and IDE extensions, allowing seamless transitions between local and cloud environments [2][16] - The Codex CLI has been updated to allow developers to share images and track progress on complex tasks, enhancing collaboration [14] - The IDE extension enables developers to use Codex within popular code editors, streamlining the coding process and maintaining context [16][17] Group 3: Competitive Landscape - The AI programming tool market is becoming increasingly competitive, with products like OpenAI Codex, Claude Code, and GitHub Copilot vying for dominance [21] - OpenAI's recent upgrades to Codex demonstrate its commitment to enhancing automation and collaboration in programming tasks, reflecting the intensifying competition in the sector [21]
Claude Code凭什么牛?大模型团队天天用自家产品,发现bug直接就改了
3 6 Ke· 2025-09-04 08:16
怎么判断模型、产品性能是否真的提升了?很简单,亲自用它实打实工作一天就知道了。 最近,Anthropic 官宣了一轮 130 亿美元的融资,公司估值达到 1830 亿美元,融资额仅次于 2025 年 3 月 OpenAI 历史性的 400 亿美元融资。 与此同时,这家也在经历新的考验:不少用户发现其王牌产品 ——Claude Code 存在降智问题,还有些开发者已经转向 OpenAI 推出的竞品 ——Codex Cli。 如果不考虑近期这些争议,其实 Claude Code 是一款非常成功的产品,它从 Cursor 那里抢走了大量用户,发布 4 个月用户就已经达到 11.5 万。 这个产品为什么可以取得成功?在最近的一次访谈中,Claude Code 负责人 Boris Cherny 透露了他们构建该产品的时的一些细节,包括极简易 用、高度可扩展的产品理念,真实体感大于 benchmark 的评估标准,极致的用户反馈响应机制等。 视频链接:https://www.youtube.com/watch?v=iF9iV4xponk 以下是详细内容: 1、过去的 12 个月里,编程领域发生了哪些变化? 一年前,如果你 ...
Claude Code凭什么牛?大模型团队天天用自家产品,发现bug直接就改了
机器之心· 2025-09-04 07:04
Core Insights - Anthropic recently announced a $13 billion funding round, bringing its valuation to $183 billion, second only to OpenAI's historic $40 billion funding in March 2025 [1] - Despite some user complaints regarding its flagship product, Claude Code, which has been reported to have "dumbing down" issues, the product has successfully captured a significant user base, reaching 115,000 users within four months of launch [3] Group 1: Product Performance and User Experience - Claude Code is designed with a philosophy of simplicity and high scalability, focusing on real user experience over benchmark evaluations [3] - The transition in programming workflows has shifted from manual coding and copy-pasting to a more hands-off approach where developers instruct agents to execute code modifications [6][7] - The evolution of models and tools, particularly Claude Code, has significantly improved programming capabilities, allowing for better integration of context management and tool usage [9] Group 2: Feedback and Iteration - Rapid feedback response is crucial for product improvement, with the team actively addressing bugs and user suggestions to foster a continuous feedback loop [15][17] - The internal feedback mechanism for Claude Code remains highly active, contributing to the product's rapid iteration and enhancement [17] Group 3: Future Developments and User Adaptation - The next 6 to 12 months will see a deeper integration of manual and automated programming, with Claude Code evolving to handle more complex project management tasks [20][21] - Developers are encouraged to adapt to these changes by focusing on core programming skills while also embracing creativity and innovation in project development [23] - New users are advised to first understand existing codebases with Claude Code before attempting to generate new code, emphasizing a strategic approach to task complexity [24][29]
刚刚,阿里最强编程模型开源,4800亿参数,Agent分数碾Kimi K2,训练细节公开
3 6 Ke· 2025-07-22 23:53
Core Insights - Alibaba's Qwen team has released its latest flagship programming model, Qwen3-Coder-480B-A35B-Instruct, which is claimed to be the most powerful open-source programming model to date, featuring 480 billion parameters and supporting up to 1 million tokens in context [1][2][16] - The model has achieved state-of-the-art performance in various programming and agent tasks, surpassing other open-source models and even competing with proprietary models like GPT-4.1 [1][3][20] - Qwen3-Coder is designed to significantly enhance productivity, allowing novice programmers to accomplish tasks in a fraction of the time it would take experienced developers [2][24] Model Specifications - Qwen3-Coder offers multiple sizes, with the current release being the most powerful variant at 480 billion parameters, which is greater than Alibaba's previous flagship model Qwen3 at 235 billion parameters but less than Kimi K2 at 1 trillion parameters [2][3] - The model supports a native context of 256K tokens and can be extended to 1 million tokens, optimized for programming tasks [16][20] Performance Metrics - In benchmark tests, Qwen3-Coder has outperformed other models in categories such as Agentic Coding, Agentic Browser Use, and Agentic Tool Use, achieving the best performance among open-source models [1][3][20] - Specific performance metrics include scores in various benchmarks, such as 69.6 in SWE-bench Verified and 77.5 in TAU-Bench Retail, showcasing its capabilities in real-world programming tasks [3][20] Pricing Structure - The API for Qwen3-Coder is available on Alibaba Cloud's platform with a tiered pricing model based on input token volume, ranging from $1 to $6 per million tokens for input and $5 to $60 for output, depending on the token range [4][5][24] - The pricing is competitive compared to other models like Claude Sonnet 4, which has lower input and output costs [4][5] User Experience and Applications - Qwen3-Coder has been made available for free on the Qwen Chat web platform, allowing users to experience its capabilities firsthand [6][24] - Users have reported impressive results in various tasks, including game development and UI design, with the model demonstrating high completion rates and aesthetic quality [9][11][12] Future Developments - The Qwen team is actively working on enhancing the model's performance and exploring self-improvement capabilities for coding agents [24] - More model sizes are expected to be released, aiming to balance deployment costs and performance [24]