Workflow
OpenCUA
icon
Search documents
港大联手月之暗面等开源OpenCUA:人人可造专属电脑智能体
机器之心· 2025-08-14 01:26
Core Viewpoint - The article discusses the launch of an open-source framework called OpenCUA for developing computer-use agents (CUA), which includes a flagship model OpenCUA-32B that achieved a 34.8% success rate on the OSWorld-Verified benchmark, surpassing GPT-4o [1][37]. Group 1: OpenCUA Framework - OpenCUA framework consists of tools for capturing human-computer interactions, a large-scale dataset called AgentNet, and a workflow for converting demonstrations into "state-action" pairs with reasoning [6][9]. - The framework aims to expand data collection across different computer environments and user scenarios, minimizing restrictions on user interactions to enhance scalability [11][12]. Group 2: AgentNet Tool and Dataset - AgentNet Tool is a cross-platform application that records user interactions on Windows, macOS, and Ubuntu, capturing screen videos and metadata for real-world computer usage demonstrations [13][15]. - The AgentNet dataset includes 22,625 manually annotated computer usage tasks from over 140 applications and 190 websites, with an average of 18.6 steps per task, reflecting task complexity [23][20]. Group 3: OpenCUA Model - The OpenCUA model integrates reflective long-chain reasoning and cross-domain data, enabling it to perform computer operation tasks in real desktop environments [29][30]. - The model variants, including OpenCUA-7B and OpenCUA-32B, were evaluated against multiple benchmarks, demonstrating superior performance compared to existing models [35][37]. Group 4: Experimental Results - OpenCUA-32B achieved the highest performance among open-source models with a 34.8% average success rate on the OSWorld-Verified benchmark, significantly closing the gap with proprietary agents [37][38]. - The model's performance improved with the scale of training data, indicating strong potential for further enhancement during testing [45][49]. Group 5: Conclusion - OpenCUA fills a critical gap in the development of computer-use agents by providing a comprehensive open-source framework, including annotation infrastructure, data processing pipelines, diverse datasets, efficient training strategies, and system evaluation benchmarks [50].