Agent S2 - filings, earnings calls, financial reports, news

Agent S2

Search documents

机器之心· 2025-10-04 03:38

Core Insights - The article discusses the advancements in computer-use agents (CUA), particularly focusing on the performance improvements of Agent S3, which has achieved a success rate of 69.9%, nearing human-level performance of 72% [1][15][16]. Technical Developments - Agent S3 builds on Agent S2, simplifying the framework and introducing a native code agent, which enhances performance from 62.6% to 69.9% [2][12]. - The introduction of the Behavior Best-of-N (bBoN) framework allows for parallel execution of agents, selecting the best outcomes from multiple runs, which significantly improves accuracy [2][8]. Performance Metrics - Agent S3's performance metrics show a 13.8% increase in success rate compared to Agent S2, with a reduction in the number of LLM calls per task by 52.3% and a decrease in average task completion time by 62.4% [15][18]. - The article highlights that when running 10 parallel agents, the performance peaks at 69.9% for GPT-5 and 60.2% for GPT-5 Mini [19]. Comparative Analysis - The bBoN framework demonstrates superior performance compared to traditional methods, achieving a success rate of 66.7% when combining models like GPT-5 and Gemini 2.5 Pro, indicating the importance of model diversity [21][22]. - Behavior narratives, as a representation method, outperform other trajectory representations, achieving a success rate of 60.2% [23][24]. Evaluation Mechanisms - The bBoN Judge shows higher accuracy in task evaluation compared to WebJudge, indicating its effectiveness in selecting the best execution results from multiple attempts [25][27]. - The alignment of the bBoN Judge with human preferences is noted, with a 92.8% accuracy in task selection, suggesting its potential as a reliable evaluation tool for CUA tasks [28][29].

Computer Use Agent (CUA)

Behavior Best-of-N (bBoN)

Artificial Intelligence

Agent S3

Agent S2

Computer Use Agent (CUA)

Behavior Best-of-N (bBoN)

Artificial Intelligence

Agent S3

Agent S2

腾讯研究院AI速递 20250430

腾讯研究院· 2025-04-29 14:54

生成式AI 一、 ChatGPT的尽头也是「带货」？升级联网搜索提供购物功能 1. OpenAI为ChatGPT推出购物搜索功能，可提供产品推荐、详情展示和直接购买链接； 2. 奥特曼态度转变，虽反对传统广告但接受收取联属费用，ChatGPT一周搜索量已超10亿次； 3. 新功能将与记忆系统整合，为Plus用户提供个性化推荐，但也引发对商业化影响用户体验的担忧。 https://mp.weixin.qq.com/s/TX68uhdKKg6esDutAmMm2w 二、马斯克：Grok 3.5 将于下周发布，能准确回答复杂技术问题 https://mp.weixin.qq.com/s/_MEGBOaRBWV2DStBKEQyag 四、 Agent S2，Simular AI 推出的第二代开源 AI Agent 框架 1. Agent S2是一款开源AI智能体框架，可直接通过图形界面操作电脑和手机，在OSWorld和 AndroidWorld测试中性能超越OpenAI和UI-TARS等竞品； 1. 马斯克宣布下周发布Grok 3.5早期测试版，限SuperGrok订阅用户使用，号称能从第一性原理 ...

生成式AI

Artificial Intelligence

Artificial Intelligence