2024年多模态大模型（MLLMs）轻量化方法研究现状和展望报告

Investment Rating - The report does not explicitly provide an investment rating for the industry Core Insights - The report discusses the innovative nature of multimodal large language models (MLLMs) that integrate language processing with multimodal capabilities, enabling them to handle various data types such as text, images, and videos [2][4] - It highlights the challenges posed by the large scale and high costs of training and inference for MLLMs, which limit their widespread application in academia and industry [4][29] - The focus is on the development of efficient and lightweight MLLMs, particularly for edge computing scenarios, which presents significant potential for future advancements [4][29] Summary by Sections Overview of Multimodal Large Language Models - MLLMs have gained success due to the scaling law, where increased resource investment leads to better performance, but high resource demands restrict their development and deployment [29] - The report emphasizes the need for lightweight MLLMs to reduce resource consumption while maintaining performance [29][54] Lightweight Optimization Methods - The report identifies three core modules of MLLMs: visual encoder, pretrained large language model, and visual-language projector, with optimization efforts focused on these areas [30][54] - Techniques for lightweight optimization include model compression methods such as quantization, pruning, and knowledge distillation, which have been explored in traditional deep learning networks [7][29] Visual Token Compression - Visual token compression is crucial for reducing computational load caused by large token sequences, which is essential for efficient MLLMs [8][57] - The report discusses various methods for multi-scale information fusion to enhance visual feature extraction, allowing models to capture fine-grained details and broader contexts [40] Efficient Structural Design - The report outlines the importance of optimizing model structures or algorithm designs to achieve high performance with fewer resources, focusing on expert mixture models and inference acceleration [9][41] - It mentions the potential of deploying lightweight MLLMs on edge devices, which could significantly enhance the capabilities of intelligent devices and robots [61]