哈工大深圳团队推出Uni-MoE-2.0-Omni：全模态理解、推理及生成新SOTA

Core Insights - The article discusses the evolution of artificial intelligence towards Omnimodal Large Models (OLMs), which can understand, generate, and process various data types, marking a shift from specialized tools to versatile partners in AI [2] - The release of the second-generation "LiZhi" Omnimodal Large Model, Uni-MoE-2.0-Omni, is highlighted, showcasing advancements in model architecture and training strategies [3][11] Model Architecture - Uni-MoE-2.0-Omni is built around a large language model (LLM) and features a unified perception and generation module, enabling comprehensive processing of text, images, videos, and audio [7] - The model employs a unified tokenization strategy for multimodal representation, utilizing a SigLIP encoder for image and video processing and Whisper-Large-v3 for audio, significantly enhancing understanding efficiency [7] - The architecture includes a Dynamic-Capacity MoE, allowing for adaptive processing based on token difficulty, which improves stability and memory management [8] - A full-modal generator integrates understanding and generation tasks into a seamless flow, enhancing capabilities in speech and visual generation [8] Training Strategies - A progressive training strategy is designed to address instability in mixed expert architectures, advancing through cross-modal alignment, expert warming, MoE fine-tuning, and generative training [11] - The team proposes a joint training method that anchors multimodal understanding and generation tasks to language generation, breaking down barriers between the two [11] Performance Evaluation - Uni-MoE-2.0-Omni has been evaluated across 85 benchmarks, achieving state-of-the-art performance in 35 tasks and surpassing the Qwen2.5-Omni model in 50 tasks, demonstrating high data utilization efficiency [13] - The model shows a 7% improvement in video evaluation benchmarks compared to Qwen2.5-Omni, indicating significant advancements in multimodal understanding [13] Use Cases - The model is capable of various applications, including visual mathematical reasoning, image generation considering seasonal factors, image quality restoration, and serving as a conversational partner [18][20][28][30] Conclusion and Outlook - Uni-MoE-2.0-Omni represents a significant advancement in the field of multimodal AI, providing a robust foundation for future research and applications in general-purpose multimodal artificial intelligence [33]