Multi-modal Large Model - filings, earnings calls, financial reports, news

Multi-modal Large Model

Search documents

给GUI Agent装上「世界模型」：阿里通义用混合数据+统一思维链，让模型学会预判屏幕变化

量子位· 2026-03-04 02:44

通义千问团队投稿量子位 | 公众号 QbitAI 伴随多模态大模型的发展，GUI Agent正成为人机交互的新范式。但在实际生产环境中，构建一个高可用、跨平台的GUI Agent面临诸多工程与算法挑战。真实环境充斥着验证码与异常弹窗导致长轨迹数据极难收集。不同平台如手机、桌面、浏览器的动作空间存在显著差异，混合训练容易引发梯度冲突。同时，真实任务通常需要模型具备长程记忆、工具调用及多Agent协作能力。为了解决原生GUI模型在端到端落地中的技术壁垒，阿里巴巴通义实验室开源了新一代多平台GUI Agent框架 Mobile-Agent-v3.5 ，并同步发布了其背后的原生基座模型家族 GUI-Owl-1.5 。 | Haiyang Xu* T | Xi Zhang* | | Haowei Liu* | Junyang Wang* | Zhaoqing Zhu* | Shengjie | | --- | --- | --- | --- | --- | --- | --- | | Zhou Xuhao Hu | | Feiyu Gao | Junjie Cao | Zihua Wang | Zhiyu ...

GUI Agent

Multi-modal Large Model

Artificial Intelligence

Mobile-Agent-v3.5

GUI-Owl-1.5

GUI Agent

Multi-modal Large Model

Artificial Intelligence

Mobile-Agent-v3.5

GUI-Owl-1.5

AI玩拼图游戏暴涨视觉理解力，告别文本中心训练，无需标注的多模态大模型后训练范式

3 6 Ke· 2025-10-15 12:27

Core Insights - The article discusses the significance of a new post-training paradigm for multimodal large language models (MLLMs) that emphasizes visual self-supervised learning, particularly through a method called Visual Jigsaw [1][12]. Group 1: Visual Jigsaw Methodology - Visual Jigsaw is designed as a self-supervised task that focuses on reconstructing visual information by predicting the correct order of shuffled visual elements, applicable to images, videos, and 3D data [5][12]. - The training process utilizes a reinforcement learning algorithm called GRPO, incorporating a tiered reward mechanism based on the accuracy of the model's predictions [5][6]. Group 2: Experimental Results - Image Jigsaw training led to consistent improvements across three vision-centric benchmarks, enhancing fine-grained perception, spatial understanding from monocular images, and compositional visual reasoning [7][8]. - Video Jigsaw training demonstrated stable enhancements in video understanding benchmarks, particularly in tasks requiring temporal reasoning and understanding [9][10]. - 3D Jigsaw training resulted in significant improvements in various 3D benchmark tasks, especially in depth estimation, indicating enhanced overall spatial perception and reasoning capabilities [11][12]. Group 3: Implications and Future Directions - The introduction of Visual Jigsaw provides a lightweight, verifiable, and annotation-free self-supervised post-training paradigm, revitalizing visual perception in MLLMs [12]. - The research aims to inspire further development of self/weakly supervised tasks that focus on visual information, enabling better perception and understanding of various visual data [12].

Artificial Intelligence

Visual Self-Supervised Learning

Multi-modal Large Model

Artificial Intelligence

Visual Jigsaw

Artificial Intelligence

Visual Self-Supervised Learning

Multi-modal Large Model

Artificial Intelligence

Visual Jigsaw