GUI Agent
Search documents
给GUI Agent装上「世界模型」:阿里通义用混合数据+统一思维链,让模型学会预判屏幕变化
量子位· 2026-03-04 02:44
通义千问团队 投稿 量子位 | 公众号 QbitAI 伴随多模态大模型的发展,GUI Agent正成为人机交互的新范式。 但在实际生产环境中,构建一个高可用、跨平台的GUI Agent面临诸多工程与算法挑战。 真实环境充斥着验证码与异常弹窗导致长轨迹数据极难收集。不同平台如手机、桌面、浏览器的动作空间存在显著差异,混合训练容易引发梯 度冲突。同时,真实任务通常需要模型具备长程记忆、工具调用及多Agent协作能力。 为了解决原生GUI模型在端到端落地中的技术壁垒,阿里巴巴通义实验室开源了新一代多平台GUI Agent框架 Mobile-Agent-v3.5 ,并同步 发布了其背后的原生基座模型家族 GUI-Owl-1.5 。 | Haiyang Xu* T | Xi Zhang* | | Haowei Liu* | Junyang Wang* | Zhaoqing Zhu* | Shengjie | | --- | --- | --- | --- | --- | --- | --- | | Zhou Xuhao Hu | | Feiyu Gao | Junjie Cao | Zihua Wang | Zhiyu ...
详解智能体2.0:手机里的“互联互通”新战场
2 1 Shi Ji Jing Ji Bao Dao· 2026-02-26 23:12
过去两年,智能体(Agent)是AI行业最重要的叙事,现在聚光灯正收束到一个更具体的方向:端侧智能体。 在海外,名为OpenClaw的智能体在硅谷技术圈走红,接管一众开发者的电脑;在国内,字节跳动把豆包嵌入手机,样机价格在二手市场居高 不下。这些智能体运行在手机、电脑和汽车上,能操作本地环境和所有工具,点外卖、打游戏、炒股票,把执行力拉到极致。 手机智能体,体验在退化? 越来越多智能体从云端落入个人终端。在国内,豆包手机助手是端侧智能体破圈的一个起点,但这条路并不始于此。 智能体还会接管更多个人设备。在发售工程版"豆包手机助手"后,据媒体披露,字节已于去年年底启动正式版手机项目,搭载智能体的新机预 计于今年Q2发布。 我们近期还从多方了解到,包括阿里系在内的多家App与字节跳动达成停火协议,App允许努比亚设备的手动登录,豆包主动限制AI操作场 景,双方回到"井水不犯河水"的状态。 行业正在形成一个共识:未来智能体的壁垒,在于能打通多少个人设备,能互联多少服务。智能体想成为新的能力层,重组我们与设备、与 App的连接方式。 但这种互联互通的技术趋势,也撞上了合规边界。智能体要想操作手机,需要利用高敏感权限进行 ...
字节开源GUI Agent登顶GitHub热榜,豆包手机核心技术突破26k Star
量子位· 2026-02-08 07:11
Core Insights - The article highlights the success of ByteDance's self-developed technology, specifically the GUI Agent model UI-TARS, which has topped GitHub's trending list and surpassed 26k stars, outperforming OpenAI's official Skills [1][3]. Group 1: Technology Overview - UI-TARS is a multi-modal AI agent that can perform complex operations on various software through natural language commands, mimicking human interactions with screens [5][9]. - The core logic of UI-TARS is "purely vision-driven," allowing the AI to observe screens like a human eye, enabling it to operate regardless of whether APIs are available or interfaces are complex [11][12]. - The technology includes two main projects: Agent TARS, which operates in both web UI and server environments, and UI-TARS-desktop, a desktop application for local computer and browser operations [6][8]. Group 2: Development and Evolution - UI-TARS aims to equip agents with four key capabilities: perception, action, reasoning, and memory [21]. - The project began a year ago and has evolved significantly, with the initial version leveraging 6 million high-quality tutorial data to enhance its deep thinking capabilities [20][24]. - Subsequent iterations, such as UI-TARS-1.5 and UI-TARS-2, have improved the agent's performance, addressing data bottlenecks and enhancing its ability to integrate various functionalities [26][28]. Group 3: Market Impact and Future Prospects - The article notes that UI-TARS has become one of the most popular open-source multi-modal agents, with significant attention from industry leaders [30]. - The technology is positioned to revolutionize how AI interacts with users, as highlighted by industry figures who predict that products like UI-TARS will significantly impact the market by 2025 [32][34]. - The article concludes by emphasizing the potential of GUI agents to bridge the gap between AI capabilities and human tasks, suggesting a transformative effect on productivity and efficiency [37][38].
手机厂商、应用方如何看AI手机争议?A2A协作有望破局
第一财经· 2026-01-12 13:37
Core Viewpoint - The development of intelligent agents should focus on both "wisdom" and "execution," ensuring they understand user intent and act accordingly while maintaining safety and control in existing governance and commercial order [3][4]. Group 1: Current Trends in AI Agents - Various exploration paths have emerged in the past year regarding whether AI can effectively handle tasks traditionally performed by humans, with products like GUI Agents attempting to automate tasks such as video editing and ticket booking by simulating human operations [3][4]. - Experts suggest that while the current GUI-based approach allows for quicker integration into real-world applications, it has inherent limitations in stability, efficiency, and governance, making it more of a temporary solution [4][5]. Group 2: Industry Perspectives - Mobile manufacturers are exploring the implementation of intelligent agents, with OPPO highlighting that the current GUI-based solutions are not the final form of AI phones but rather a method to operate existing interfaces [5]. - The core value for mobile manufacturers lies not in the model parameters but in their long-term understanding of users, emphasizing that "memory" is the true essence of a smartphone [5][6]. Group 3: Challenges and Future Directions - The real challenge for intelligent agents is not just task completion but also defining operational boundaries and management challenges [5][6]. - Current GUI models can stimulate the industry, but the Chinese AI sector should not be limited to this route; it should explore safer and more advanced evolutionary paths, taking cues from Apple's collaborative mechanisms between intelligent agents and applications [6][7]. - The introduction of A2A (AI-to-AI) mechanisms is suggested to improve governance and market competition, addressing potential risks associated with disruptive innovation [6][7].
手机厂商、应用方如何看AI手机争议?A2A协作有望破局
Di Yi Cai Jing· 2026-01-12 12:24
Core Insights - The true challenge of intelligent agents is not just their ability to perform tasks but also their wisdom and execution capabilities, which require a deep understanding of user intent and actions [1] - The development of intelligent agents should not disrupt existing governance frameworks and commercial orders but should promote industry evolution through deep collaboration while ensuring safety and control [1] Group 1: Current Developments in AI Agents - Various exploration paths have emerged in the past year regarding whether AI can replace human tasks, with products attempting to operate through screen understanding and simulated actions, categorized as GUI Agents [3] - These products face significant challenges, including permission granting, accountability for errors, service invocation, and regulatory constraints [3] - Experts suggest that the authorization of intelligent agents should be scene-specific, with critical operations requiring secondary confirmation, and that not all scenarios should be authorized by third-party platforms [3] Group 2: Industry Perspectives on AI Implementation - OPPO's perspective indicates that while products like the Doubao phone positively impact the industry, they are not the final form of AI phones but rather a method to operate existing GUI interfaces [4] - The focus for phone manufacturers is on engineering and scalability, as any instability in system capabilities can lead to significant quality issues [4] - The future of intelligent agents is expected to shift towards A2A (Agent-to-Agent) collaboration models, with the core value for manufacturers lying in their long-term understanding of users rather than just model parameters [4] Group 3: Regulatory and Safety Considerations - The current GUI approach can activate the industry but should not be the sole focus; a more optimal evolution path that balances safety and development should be explored [5] - Apple's model is highlighted as a reference, establishing a collaborative mechanism between intelligent agents and apps through open APIs while ensuring safety boundaries [5] - The introduction of A2A mechanisms and market-based credit systems is suggested to improve authorization processes and manage potential risks associated with disruptive innovations [5]
从豆包手机谈起:端侧智能的愿景与路线图
AI前线· 2025-12-22 05:01
Core Viewpoint - The launch of Doubao Mobile Assistant by ByteDance signifies a significant shift in the application paradigm of large models, transitioning from "Chat" to "Action," establishing it as the first system-level GUI Agent in the industry [2][3]. Technical Analysis and Evaluation - The core technology of Doubao Mobile Assistant is the GUI Agent, which has evolved from an "external framework" to a "model-native intelligent agent" between 2023 and 2025. The early stage (2023-2024) relied on external frameworks that limited the agent's capabilities due to dependency on prompt engineering and external tools [4]. - The introduction of visual language models driven by imitation learning in 2024 marked a shift to model-native capabilities, allowing the agent to understand interfaces directly from pixel inputs, significantly enhancing adaptability to unstructured GUIs [5]. - By 2024-2025, reinforcement learning-driven visual language models became mainstream, enabling agents to autonomously execute tasks in dynamic environments. Doubao Mobile Assistant embodies this technological evolution [5][7]. Development History of GUI Agent - Previous GUI Agents were often limited to demo stages due to reliance on Android accessibility services, which had significant drawbacks. Doubao Mobile Assistant overcomes these issues through a customized OS that allows for non-intrusive system-level control [7][8]. - The model architecture of Doubao Mobile Assistant employs a collaborative end-cloud model, indicating a shift from experimental to practical applications of GUI Agents [8]. Limitations and Future Outlook - Doubao Mobile Assistant faces three major challenges: security risks associated with cloud-side model reliance, insufficient autonomous task completion capabilities, and limited ecological coverage [9][10][11]. - The assistant currently operates as a passive tool, lacking personalized proactive service capabilities. Future developments must focus on enhancing privacy, environmental perception, complex decision-making, and personalized service [12][13]. Evolution of End-Side Intelligence - The emergence of system-level GUI Agents presents a fundamental contradiction between the need for comprehensive operational visibility and user privacy concerns. A balance must be struck to ensure user data sovereignty while providing intelligent services [13][14]. - The future AI mobile ecosystem should adhere to the principle of "end-side native, cloud collaboration," ensuring that sensitive user data remains on-device while leveraging cloud capabilities for complex tasks [14][15]. Autonomous Intelligence and User Interaction - Doubao Mobile Assistant's current capabilities are based on extensive data training, but future autonomous intelligence must enable agents to learn and adapt in dynamic environments, overcoming challenges in generalization, autonomy, and long-term interaction [22][24][25]. - The transition from passive execution to proactive service is essential for personal assistants to reduce user cognitive load and enhance user experience [29][30][31]. Industry Trends and Future Predictions - In the short term (within one year), more mobile assistants are expected to launch, intensifying competition between application developers and hardware manufacturers [35]. - In the medium term (2-3 years), the concept of a "personal exclusive assistant" will solidify, with end-side models evolving to provide personalized experiences based on user data [36]. - In the long term (3-5 years), a new type of end-side hardware will emerge, integrating high privacy operations and lightweight tasks, ensuring data sovereignty and rapid response times [38].
豆包手机引发的思考:AgentVS超级App,AI公司VS手机厂商
新财富· 2025-12-16 08:22
Core Viewpoint - The launch of the Doubao mobile assistant by ByteDance represents a significant step towards the realization of system-level AI agents, challenging the dominance of super apps like WeChat and Alipay in the mobile ecosystem [2][14][27] Group 1: Doubao Mobile Assistant Launch - On December 1, ByteDance's Doubao team released a technical preview of the Doubao mobile assistant, which collaborates deeply with phone manufacturers at the operating system level to enable cross-application automation [2] - The initial batch of 30,000 units sold out instantly, but within two days, major super apps like WeChat, Alipay, Taobao, and Meituan blocked the Doubao mobile assistant [3] Group 2: AI Agent Development - The Doubao mobile assistant demonstrates the feasibility of GUI agents, completing a closed-loop attempt for AI phones, but raises questions about its practical utility in real-world scenarios [5] - The evolution of AI agents has transitioned from fixed scripts and rule engines to a stage where GUI intelligent agents can understand and operate across applications, as seen with advancements from companies like Microsoft and Anthropic [6][7] Group 3: System-Level Agent vs. Super Apps - The system-level agent can understand user intent and orchestrate multiple apps, moving the focus from an app-centric model to a user-intent-centric model [8][10] - The core advantages of system-level agents include the ability to organize tasks across multiple apps and theoretical platform neutrality, alleviating long-standing issues like fragmented cross-app processes [11][12] Group 4: Industry Dynamics and Conflicts - The emergence of the Doubao mobile assistant has highlighted the conflict between system-level agents and super apps, with super apps responding defensively to protect their user entry points [14][15] - The long-term outcome may not be the elimination of one model over the other, but rather a redefinition of power boundaries and responsibilities between system-level agents and super apps [17] Group 5: Manufacturer Strategies - Different manufacturers are adopting varied strategies regarding AI agents, with Huawei integrating agents into its operating system, Xiaomi focusing on ecosystem integration, and Apple maintaining a single official agent [19][23][24] - The competitive landscape suggests a future where multiple agents coexist in the Android ecosystem, while iOS maintains a clearer structure with one official agent and several plugins [24][25]
豆包手机触碰了大厂APP的“逆鳞”
3 6 Ke· 2025-12-15 23:28
Core Viewpoint - The emergence of Doubao AI phone has sparked significant interest in the potential of AI agents as a new entry point in the internet ecosystem, but it faces immediate backlash from major internet companies due to security and operational concerns [1][2][3]. Group 1: Doubao AI Phone and Its Features - Doubao AI phone allows users to perform complex cross-application operations through an integrated AI agent, enhancing user experience [1]. - Users reported issues with accessing major applications like WeChat and Alipay shortly after the phone's launch, indicating a significant reduction in functionality [2]. - The phone's initial appeal diminished as it could no longer utilize its AI capabilities effectively with popular apps, leading to a decline in user experience [2]. Group 2: Industry Response and Competition - Industry insiders expressed a lack of surprise at the backlash against Doubao, with Tencent attributing the issues to existing security measures [3]. - The competition for the next generation of traffic entry points among internet giants is intensifying, with companies like Alibaba and Tencent scrambling to establish their AI applications [4][5]. - Doubao's rapid rise in daily active users (DAU) highlights its initial success, but subsequent declines in user engagement raise questions about its sustainability [6]. Group 3: The Shift in User Engagement and Advertising - The dominance of major apps like Taobao and WeChat has led to a high concentration of user traffic, creating a "traffic anxiety" among internet companies [4][5]. - The introduction of GUI agents, which can operate apps without user interaction, threatens traditional advertising revenue models by bypassing app usage [13][15]. - The growth of AI assistants among smartphone manufacturers indicates a shift in the value chain from internet companies to hardware manufacturers [16]. Group 4: Future Implications and Developments - The release of Doubao AI phone has prompted other companies to accelerate their development of AI agents, with a focus on creating competitive products [19][20]. - The open-sourcing of AI agent models could democratize access to this technology, potentially leading to a proliferation of personalized agents that challenge established players [21]. - The urgency for internet giants to adapt and innovate in response to the evolving landscape of AI applications is becoming increasingly critical [22].
豆包“撕裂”AI手机
投中网· 2025-12-13 06:49
Core Viewpoint - The emergence of the Doubao phone, a collaboration between Doubao and Nubia, has disrupted the AI smartphone market, showcasing advanced capabilities while sparking significant controversy and debate within the industry [6][7]. Group 1: Product Overview - The Doubao phone is technically a preview version of the Nubia M153, featuring deep integration of the Doubao assistant into its operating system, allowing for continuous operations beyond traditional voice assistants [6][7]. - The phone's market price surged from the original 3499 yuan to as high as 36,000 yuan, reflecting a split sentiment between excitement and skepticism among consumers [6]. - It has been praised for its ability to perform complex tasks across multiple applications, such as answering questions on Bilibili and making purchases, leading to comparisons with human-like interaction [6][9]. Group 2: AI Smartphone Landscape - The concept of "AI smartphones" gained traction in the second half of 2023, with major manufacturers like Samsung, Google, and Xiaomi emphasizing AI integration, but many offerings were criticized for lacking true innovation [8][9]. - Doubao's approach is more aggressive, enabling extensive cross-application operations that surpass the capabilities of existing AI smartphones, which are often limited to predefined tasks [9][11]. Group 3: Technical Pathways - The industry is divided into two main technical pathways for AI smartphones: traditional methods relying on SDK interfaces and the newer GUI Agent approach, which allows direct interaction with the screen [9][10]. - Doubao's implementation of GUI Agent technology enables it to autonomously handle tasks without relying on app-specific interfaces, a significant advancement over earlier AI assistants [10][11]. Group 4: Industry Conflict - The Doubao phone's capabilities have led to pushback from major apps like WeChat and Alipay, which have implemented restrictions to prevent automated operations, highlighting a clash between traditional app security measures and new AI functionalities [14][15]. - The core of the conflict lies in differing interpretations of user operation permissions, with apps prioritizing human-like interactions while AI systems advocate for user-authorized automation [15][16]. Group 5: Market Dynamics and Future Outlook - The AI smartphone sector is becoming a battleground for tech companies vying for dominance in the AI era, with the potential to redefine user interaction and data flow [22][23]. - The emergence of the Doubao phone has prompted major tech firms to reassess their AI strategies, leading to a competitive landscape categorized into three tiers based on integration capabilities and market responsiveness [24][25][26].
AI版「互联网协议」面世,豆包手机们再也不怕被「封禁」了?
3 6 Ke· 2025-12-12 08:36
Core Viewpoint - The article discusses the growing restrictions on the "Doubao Phone" (Nubia M53) applications, highlighting a significant conflict between AI-driven tools and established app ecosystems, particularly regarding user access and operational permissions [1][13]. Group 1: Doubao Phone and GUI Agent - The Doubao Phone is facing increasing bans on major applications like WeChat, Alipay, and various e-commerce platforms, limiting user access [1]. - The Doubao Phone Assistant employs a GUI Agent approach, allowing AI to interact with mobile interfaces without relying on official APIs, which raises concerns among major app providers [2][15]. - The conflict is not new; platforms like WeChat have previously opposed GUI-based AI interactions, indicating a broader resistance within the industry [13][15]. Group 2: MCP Protocol and Industry Standards - The Model Context Protocol (MCP) has emerged as a potential solution to the challenges posed by GUI Agents, aiming to establish a standardized interface for AI interactions across platforms [4][5]. - MCP is gaining traction as a de facto standard, with major tech companies like OpenAI and Google integrating it into their systems, indicating a shift towards a more interoperable AI ecosystem [7][8]. - The donation of MCP to the Linux Foundation signifies a move towards a neutral governance structure, enhancing its credibility and adoption across the industry [8][9]. Group 3: Future of AI Interaction - The article suggests that the future of AI will rely on a combination of GUI and MCP approaches, where GUI serves as a fallback in the current ecosystem while MCP establishes clearer operational boundaries for AI interactions [20][21]. - The transition to MCP will require significant changes in the internet ecosystem, but it promises a more structured and secure way for AI to interact with various platforms [19][20]. - Ultimately, the goal is to create a unified system where AI can operate seamlessly across different services while adhering to established rules and permissions [20][21].