Workflow
聚焦多模态:ChatGPT时刻未到,2025大模型“变慢”了吗
Bei Jing Shang Bao·2025-06-08 13:27

Core Insights - The emergence of multi-modal models, such as Emu3, signifies a shift in content generation, with the potential to understand and generate text, images, and videos through a single model [1][3] - The rapid development of AI has led to a competitive landscape where new and existing products coexist, but the core capabilities of video generation are still lagging behind expectations [1][5] - The commercial application of large models faces challenges, particularly in integrating visual generation with existing models, which limits scalability and effectiveness [7][8] Multi-Modal Model Development - Emu3, released by Zhiyuan Research Institute, is a native multi-modal model that incorporates various data types from the beginning of its training process, unlike traditional models that focus on language first [3][4] - The current learning path for multi-modal models often leads to a decline in performance as they transition from strong language capabilities to integrating other modalities [3][4] - The development of multi-modal models is still in its early stages, with significant technical challenges remaining, particularly in filtering effective information from diverse data types [3][4] Video Generation Challenges - Video generation technology is currently at a transitional phase, comparable to the evolution from GPT-2 to GPT-3, indicating that there is substantial room for improvement [5][6] - Key issues in video generation include narrative coherence, stability, and controllability, which are essential for producing high-quality content [6] - The industry is awaiting a breakthrough moment akin to the "ChatGPT moment" to enhance video generation capabilities [6] Commercialization and Market Growth - The multi-modal AI market is projected to reach $2.4 billion in 2024, with a compound annual growth rate (CAGR) exceeding 28%, and is expected to grow to $128 billion by 2025, reflecting a CAGR of 62.3% from 2023 to 2025 [8] - The integration of traditional computer vision models with large models is seen as a potential pathway for commercial applications, contingent on achieving a favorable cost-benefit ratio [7][8] - Companies are evolving their service models from providing platforms (PaaS) to offering tools (SaaS) and ultimately delivering direct results to users by 2025 [8]