Core Insights - The article discusses the emergence of GUI Agents as a new paradigm in human-computer interaction, driven by the development of multimodal large models [1] - It highlights the challenges in building high-availability, cross-platform GUI Agents in real-world environments, including issues with data collection and the need for long-term memory and multi-Agent collaboration [2] Group 1: Technical Developments - Alibaba's Tongyi Lab has open-sourced the Mobile-Agent-v3.5 framework and the underlying GUI-Owl-1.5 model family to address technical barriers in deploying native GUI models [2][6] - The GUI-Owl-1.5 model family has achieved leading test results on over 20 mainstream GUI benchmarks, enabling unified control across desktop, mobile, and browser platforms [6] Group 2: Architectural Design - The architecture of GUI-Owl-1.5 decouples execution and reasoning, supporting edge-cloud collaboration with two model variants: Instruct for rapid response and Thinking for complex tasks [9] - The system allows for external tool invocation and supports the Model Context Protocol (MCP) for complex calculations or database queries [9] Group 3: Core Technical Principles - The performance of GUI-Owl-1.5 is attributed to enhancements in data pipeline construction, internal logic restructuring, and reinforcement learning algorithms [10] - A hybrid data pipeline has been developed to address the challenges of long trajectory synthesis, utilizing multimodal models for high-resolution UI screenshot generation [12][15] Group 4: Reinforcement Learning Innovations - The MRPO (Multi-platform Reinforcement Policy Optimization) algorithm addresses engineering bottlenecks in multi-platform reinforcement learning, overcoming issues like gradient conflicts and training instability [20][22] - The algorithm employs an online oversampling mechanism to maintain on-policy assumptions while enhancing sample diversity [22][28] Group 5: Evaluation and Performance - The GUI-Owl-1.5 model family has been rigorously evaluated across various dimensions, establishing new state-of-the-art (SOTA) benchmarks in the open-source domain [34] - The model demonstrated significant performance improvements on mobile and PC platforms, with the MRPO approach yielding better results compared to mixed training [35] Group 6: Grounding and Tool Collaboration - The model excels in grounding capabilities, achieving high accuracy in visual positioning tasks without cropping tools, and further improving with a two-stage Zoom-In strategy [41] - In complex scenarios, the model showcases its ability to perform cross-application collaboration and long-term memory management, effectively executing multi-step workflows [46][49] Group 7: Conclusion - The release of Mobile-Agent-v3.5 provides a comprehensive technical reference for developing engineering-grade Agents capable of executing long processes across multiple platforms [51] - The project has made its model weights, Agent framework source code, and online demo available on GitHub for community engagement and technical exchange [52]
给GUI Agent装上「世界模型」:阿里通义用混合数据+统一思维链,让模型学会预判屏幕变化