Workflow
Behavior Best-of-N (bBoN)
icon
Search documents
你敢信?GPT-5的电脑操作水平只比人类低2%了
机器之心· 2025-10-04 03:38
Core Insights - The article discusses the advancements in computer-use agents (CUA), particularly focusing on the performance improvements of Agent S3, which has achieved a success rate of 69.9%, nearing human-level performance of 72% [1][15][16]. Technical Developments - Agent S3 builds on Agent S2, simplifying the framework and introducing a native code agent, which enhances performance from 62.6% to 69.9% [2][12]. - The introduction of the Behavior Best-of-N (bBoN) framework allows for parallel execution of agents, selecting the best outcomes from multiple runs, which significantly improves accuracy [2][8]. Performance Metrics - Agent S3's performance metrics show a 13.8% increase in success rate compared to Agent S2, with a reduction in the number of LLM calls per task by 52.3% and a decrease in average task completion time by 62.4% [15][18]. - The article highlights that when running 10 parallel agents, the performance peaks at 69.9% for GPT-5 and 60.2% for GPT-5 Mini [19]. Comparative Analysis - The bBoN framework demonstrates superior performance compared to traditional methods, achieving a success rate of 66.7% when combining models like GPT-5 and Gemini 2.5 Pro, indicating the importance of model diversity [21][22]. - Behavior narratives, as a representation method, outperform other trajectory representations, achieving a success rate of 60.2% [23][24]. Evaluation Mechanisms - The bBoN Judge shows higher accuracy in task evaluation compared to WebJudge, indicating its effectiveness in selecting the best execution results from multiple attempts [25][27]. - The alignment of the bBoN Judge with human preferences is noted, with a 92.8% accuracy in task selection, suggesting its potential as a reliable evaluation tool for CUA tasks [28][29].