多模态和Agent成为大厂AI的新赛点

Core Viewpoint - The article discusses the evolution of large models in consumer-facing applications, focusing on enhancing user interaction and enabling complex task execution through multi-modal capabilities and agent product ecosystems [4][6]. Multi-modal Capabilities - Major companies like ByteDance, Baidu, Google, and OpenAI have recently launched advanced multi-modal models, enabling innovative applications [4]. - Alibaba's AI product Quark introduced a new feature called "Photo Ask Quark," which utilizes multi-modal capabilities for enhanced user interaction [4][10]. - The development of multi-modal reasoning abilities is evident in products like Byte's Doubao 1.5 and OpenAI's o3 and o4-mini, which can analyze images and generate content [9][10]. Agent Execution Capabilities - The emergence of general agent products aims to execute complex tasks through natural language commands, with recent launches from companies like ByteDance and Baidu [4][5]. - The article highlights the need for agents to possess three key capabilities: integration with third-party data and tools, coding abilities, and strong task understanding [20][23]. - Manus has set a direction for agent products, showcasing a framework that combines user task understanding with tool integration [17]. Future of Agents - The ultimate goal for agents remains uncertain, with ongoing exploration in their development and application [7]. - The integration of multi-modal capabilities and agent execution abilities is crucial for creating a foundational entry point for future applications [25]. - OpenAI anticipates that AI agents will surpass ChatGPT in sales by the end of 2025, projecting revenues of $3 billion, with further growth expected by 2029 [25].