Core Insights - The era of multimodal agent AI has just begun, with significant technical challenges remaining to achieve Artificial General Intelligence (AGI) [1] - Multimodal large models integrate various inputs and outputs such as text, speech, images, and videos, enhancing processing capabilities and user interaction experiences [3] Multimodal Technology Development - Multimodal technology is essential for achieving AGI as it provides richer contextual understanding and improves model performance and accuracy [3] - The technology can be categorized into understanding and generation tasks, with challenges in modality encoding alignment and high-quality content generation [3][4] Technical Evolution - The current multimodal understanding models are primarily based on pre-trained large model technology, with differences mainly in connector design and modality alignment methods [3] - Multimodal understanding models mainly focus on visual and language aspects, with aspirations to handle more modalities in the future [3] Future Directions - Future multimodal large models are expected to unify understanding and generation, although key technologies such as backbone network design and modality alignment still require further research [4] - The industry remains in its early stages, but there is confidence in the application prospects of multimodal technology in fields like search, creation, and robotics [4]
阿里巴巴集团副总裁许主洪:多模态大模型是通往AGI的关键路径|直击MWC上海2025