Workflow
LongHorizonUI
icon
Search documents
ICLR 2026 | LongHorizonUI:让 GUI 智能体不再"半途而废"——面向长链路任务的统一鲁棒自动化框架
机器之心· 2026-03-12 08:19
Core Viewpoint - The article discusses the development of LongHorizonUI, a unified framework designed to enhance the automation of long-horizon tasks for GUI agents, addressing the significant drop in success rates when task steps exceed 10-15 [2][5]. Group 1: Research Background - Traditional GUI automation methods struggle with long sequences of operations, showing a success rate drop from over 90% for sequences of 5 steps to below 75% for sequences over 10 steps, and around 60% for sequences exceeding 15 steps [5]. - The research team identified the need for a solution that maintains contextual consistency and decision accuracy in long-step operation sequences [5]. Group 2: Benchmark Development - A new benchmark, LongGUIBench, was created to evaluate long-horizon tasks, with all tasks having a minimum of 15 steps and an average of 22.1 steps [7]. - The dataset includes two categories: general application scenarios with 147 end-to-end task chains averaging 19.5 steps, and gaming scenarios with 207 high-complexity chains averaging 23.7 steps, with the longest reaching 37 steps [7]. Group 3: Core Methodology - LongHorizonUI consists of three core modules: Multi-modal Enhanced Perception (MEP), Deep Reflective Decision (DRD), and Compensatory Executor (CAE), forming a complete loop from perception to execution [9][19]. - MEP enhances perception by assigning unique spatial index IDs to UI elements and addressing ambiguities in composite controls through a semantic binding mechanism [12]. - DRD enforces a three-level reasoning process to ensure decision accuracy, including historical validation, goal checking, and action interpretability [12]. - CAE maps decision outputs to physical screen coordinates, employing multiple strategies to ensure successful execution [13]. Group 4: Experimental Results - LongHorizonUI demonstrated significant advantages in long-horizon tasks, achieving a success rate of 85.3% for low-level instructions and 52.3% for high-level instructions in general scenarios, outperforming previous methods [15]. - In gaming scenarios, the success rates were 83.9% for low-level and 52.1% for high-level instructions, with an overall average of 77.3% [15]. - The framework also achieved a 90.4% average accuracy on the ScreenSpot cross-platform UI element localization benchmark, showcasing its robustness across different platforms [15]. - In a 50-step long chain setting, LongHorizonUI reached a success rate of 29.4%, surpassing previous benchmarks [16]. Group 5: Conclusion - LongHorizonUI provides a comprehensive solution for long-horizon GUI automation tasks, effectively mitigating error accumulation through its structured design, and the LongGUIBench benchmark offers a standardized evaluation platform for future research [19].