Workflow
Agent S2
icon
Search documents
你敢信?GPT-5的电脑操作水平只比人类低2%了
机器之心· 2025-10-04 03:38
Core Insights - The article discusses the advancements in computer-use agents (CUA), particularly focusing on the performance improvements of Agent S3, which has achieved a success rate of 69.9%, nearing human-level performance of 72% [1][15][16]. Technical Developments - Agent S3 builds on Agent S2, simplifying the framework and introducing a native code agent, which enhances performance from 62.6% to 69.9% [2][12]. - The introduction of the Behavior Best-of-N (bBoN) framework allows for parallel execution of agents, selecting the best outcomes from multiple runs, which significantly improves accuracy [2][8]. Performance Metrics - Agent S3's performance metrics show a 13.8% increase in success rate compared to Agent S2, with a reduction in the number of LLM calls per task by 52.3% and a decrease in average task completion time by 62.4% [15][18]. - The article highlights that when running 10 parallel agents, the performance peaks at 69.9% for GPT-5 and 60.2% for GPT-5 Mini [19]. Comparative Analysis - The bBoN framework demonstrates superior performance compared to traditional methods, achieving a success rate of 66.7% when combining models like GPT-5 and Gemini 2.5 Pro, indicating the importance of model diversity [21][22]. - Behavior narratives, as a representation method, outperform other trajectory representations, achieving a success rate of 60.2% [23][24]. Evaluation Mechanisms - The bBoN Judge shows higher accuracy in task evaluation compared to WebJudge, indicating its effectiveness in selecting the best execution results from multiple attempts [25][27]. - The alignment of the bBoN Judge with human preferences is noted, with a 92.8% accuracy in task selection, suggesting its potential as a reliable evaluation tool for CUA tasks [28][29].
腾讯研究院AI速递 20250430
腾讯研究院· 2025-04-29 14:54
生成式AI 一、 ChatGPT的尽头也是「带货」 ? 升 级联网 搜索 提供购物 功能 1. OpenAI为ChatGPT推出购物搜索功能,可提供产品推荐、详情展示和直接购买链接; 2. 奥特曼态度转变,虽反对传统广告但接受收取联属费用,ChatGPT一周搜索量已超10亿 次; 3. 新功能将与记忆系统整合,为Plus用户提供个性化推荐,但也引发对商业化影响用户体验 的担忧。 https://mp.weixin.qq.com/s/TX68uhdKKg6esDutAmMm2w 二、 马斯克:Grok 3.5 将于下周发布,能准确回答复杂技术问题 https://mp.weixin.qq.com/s/_MEGBOaRBWV2DStBKEQyag 四、 Agent S2,Simular AI 推出的第二代开源 AI Agent 框架 1. Agent S2是一款开源AI智能体框架,可直接通过图形界面操作电脑和手机,在OSWorld和 AndroidWorld测试中性能超越OpenAI和UI-TARS等竞品; 1. 马斯克宣布下周发布Grok 3.5早期测试版,限SuperGrok订阅用户使用,号称能从第一性 原理 ...