Core Insights - The article discusses the advancements in multi-modal large models (MLLM) and the development of mobile GUI agents that can autonomously understand and execute complex tasks on smartphones [2][3]. Group 1: Challenges in Mobile GUI Agents - A significant challenge in training mobile GUI agents is the reliance on high-quality expert demonstration data, which is costly to obtain and limits the agents' generalization and robustness [2][7]. - The correct execution of GUI operations is highly dependent on historical context, making it difficult to evaluate the effectiveness of each action in a task [6][7]. Group 2: UI-Genie Framework - The UI-Genie framework allows for self-evolving agents through collaboration between the agent model and a reward model, enabling high-quality data synthesis without manual annotation [3][27]. - UI-Genie-RM is introduced as the first specialized reward model for evaluating mobile GUI agent trajectories, designed to consider the entire operation history [9][10]. Group 3: Data Generation and Model Iteration - UI-Genie employs a closed-loop mechanism for data generation and model iteration, which includes reward-guided trajectory exploration, dual expansion of training data, and progressive task complexity enhancement [14][19]. - The framework has demonstrated significant improvements in task success rates and evaluation accuracy through iterative training, with the agent's success rate increasing from 18.1% to 38.7% [24]. Group 4: Performance and Future Applications - UI-Genie outperforms baseline methods in both offline and online operation tasks, achieving a 77.0% operation success rate and 86.3% element localization accuracy with a 72B model [21][23]. - The framework is expected to expand to more complex multi-modal interaction scenarios, including desktop agents, and aims to integrate reward models with reinforcement learning for autonomous growth [27][29].
vivo AI Lab提出自我进化的移动GUI智能体,UI-Genie无需人工标注实现性能持续提升
机器之心·2025-11-07 07:17