Workflow
Multi-modal Large Model
icon
Search documents
给GUI Agent装上「世界模型」:阿里通义用混合数据+统一思维链,让模型学会预判屏幕变化
量子位· 2026-03-04 02:44
通义千问团队 投稿 量子位 | 公众号 QbitAI 伴随多模态大模型的发展,GUI Agent正成为人机交互的新范式。 但在实际生产环境中,构建一个高可用、跨平台的GUI Agent面临诸多工程与算法挑战。 真实环境充斥着验证码与异常弹窗导致长轨迹数据极难收集。不同平台如手机、桌面、浏览器的动作空间存在显著差异,混合训练容易引发梯 度冲突。同时,真实任务通常需要模型具备长程记忆、工具调用及多Agent协作能力。 为了解决原生GUI模型在端到端落地中的技术壁垒,阿里巴巴通义实验室开源了新一代多平台GUI Agent框架 Mobile-Agent-v3.5 ,并同步 发布了其背后的原生基座模型家族 GUI-Owl-1.5 。 | Haiyang Xu* T | Xi Zhang* | | Haowei Liu* | Junyang Wang* | Zhaoqing Zhu* | Shengjie | | --- | --- | --- | --- | --- | --- | --- | | Zhou Xuhao Hu | | Feiyu Gao | Junjie Cao | Zihua Wang | Zhiyu ...
AI玩拼图游戏暴涨视觉理解力,告别文本中心训练,无需标注的多模态大模型后训练范式
3 6 Ke· 2025-10-15 12:27
Core Insights - The article discusses the significance of a new post-training paradigm for multimodal large language models (MLLMs) that emphasizes visual self-supervised learning, particularly through a method called Visual Jigsaw [1][12]. Group 1: Visual Jigsaw Methodology - Visual Jigsaw is designed as a self-supervised task that focuses on reconstructing visual information by predicting the correct order of shuffled visual elements, applicable to images, videos, and 3D data [5][12]. - The training process utilizes a reinforcement learning algorithm called GRPO, incorporating a tiered reward mechanism based on the accuracy of the model's predictions [5][6]. Group 2: Experimental Results - Image Jigsaw training led to consistent improvements across three vision-centric benchmarks, enhancing fine-grained perception, spatial understanding from monocular images, and compositional visual reasoning [7][8]. - Video Jigsaw training demonstrated stable enhancements in video understanding benchmarks, particularly in tasks requiring temporal reasoning and understanding [9][10]. - 3D Jigsaw training resulted in significant improvements in various 3D benchmark tasks, especially in depth estimation, indicating enhanced overall spatial perception and reasoning capabilities [11][12]. Group 3: Implications and Future Directions - The introduction of Visual Jigsaw provides a lightweight, verifiable, and annotation-free self-supervised post-training paradigm, revitalizing visual perception in MLLMs [12]. - The research aims to inspire further development of self/weakly supervised tasks that focus on visual information, enabling better perception and understanding of various visual data [12].