自适应探索策略优化框架(AEPO)
Search documents
AAAI 2026 Oral|InfiGUI-G1模型来了,刷新GUI Grounding SOTA
机器之心· 2026-01-05 06:09
Core Insights - The article discusses the advancements in multi-modal large language models (MLLMs) and the challenges in achieving GUI grounding, particularly the distinction between spatial alignment and semantic alignment [2][6][7] - A new framework called Adaptive Exploration Policy Optimization (AEPO) is introduced, which enhances the performance of the InfiGUI-G1 model in GUI grounding tasks [2][14] Group 1: GUI Grounding Challenges - GUI grounding involves mapping natural language commands to specific screen elements, which can be broken down into spatial alignment (accurate positioning) and semantic alignment (correct element identification) [6][7] - Existing methods, particularly those based on Reinforcement Learning with Verification Rewards (RLVR), excel in spatial alignment but struggle with semantic alignment due to issues like the "confidence trap," where models repeatedly make high-confidence but incorrect predictions [8][10] Group 2: InfiGUI-G1 Model and AEPO Framework - The InfiGUI-G1 model, developed by a research team from Zhejiang University, Hong Kong Polytechnic University, and InfiX.ai, utilizes AEPO to overcome exploration inefficiencies in traditional RL methods [2][14] - AEPO consists of three core components: 1. Multi-Answer Generation, which allows the model to generate multiple candidate coordinates in a single pass, increasing the likelihood of finding the correct answer [15] 2. Adaptive Exploration Reward (AER), which evaluates the quality of generated answers based on efficiency principles [16] 3. Collinear Penalty, which discourages the model from generating geometrically aligned points to ensure diverse exploration in semantic space [16] Group 3: Performance Evaluation - InfiGUI-G1 has been evaluated on challenging benchmarks such as MMBench-GUI, ScreenSpot-Pro, and UI-Vision, demonstrating superior performance compared to existing models, including those with significantly larger parameter counts [19] - Notably, InfiGUI-G1-7B outperformed models like Qwen2.5-VL-72B and GPT-4o on several metrics, showcasing its effectiveness in semantic understanding tasks [19] - The model showed over 60% improvement on difficult samples, indicating its ability to uncover previously neglected knowledge due to exploration limitations [20] Group 4: Conclusion and Future Outlook - The success of InfiGUI-G1 highlights that the performance bottleneck in GUI agents lies not only in visual recognition but also in effective reinforcement learning strategies to address semantic alignment issues [23] - The introduction of adaptive exploration mechanisms allows InfiGUI-G1 to achieve superior GUI grounding capabilities with a smaller model size, laying a solid foundation for the development of more intelligent GUI interaction assistants [23][24]