Core Insights - The article discusses the emergence of Multimodal Retrieval-Augmented Generation (MM-RAG) as a new field, highlighting its potential applications and the current state of research, which is still in its infancy [2][5][17] - A comprehensive survey published by researchers from Huazhong University of Science and Technology, Fudan University, China Telecom, and the University of Illinois at Chicago covers nearly all possible combinations of modalities for input and output in MM-RAG [4][17] Summary by Sections Overview of MM-RAG - MM-RAG is an evolution of traditional Retrieval-Augmented Generation (RAG) that incorporates multiple modalities such as text, images, audio, video, code, tables, knowledge graphs, and 3D objects [2][4] - Current research primarily focuses on limited combinations of modalities, leaving many potential applications unexplored [2][5] Potential Combinations - The authors identify a vast space of potential input-output modality combinations, revealing that out of 54 proposed combinations, only 18 have existing research [5][6] - Notably, combinations like "text + video as input, generating video as output" remain largely untapped [5] Classification Framework - A new classification framework for MM-RAG is established, systematically organizing existing research and clearly presenting the core technical components of different MM-RAG systems [6][15] - This framework serves as a reference for future research and development in the field [6][15] MM-RAG Workflow - The MM-RAG workflow is divided into four key stages: 1. Pre-retrieval: Organizing data and preparing queries [11] 2. Retrieval: Efficiently finding relevant information from a multimodal knowledge base [12] 3. Augmentation: Integrating retrieved multimodal information into the large model [13] 4. Generation: Producing high-quality multimodal outputs based on input and augmented information [14][15] Practical Guidance - The survey provides a one-stop guide for building MM-RAG systems, covering training, evaluation, and application strategies [17][18] - It discusses training methods to maximize retrieval and generation capabilities, summarizes existing evaluation metrics, and explores potential applications across various fields [18]
迎接「万物皆可RAG」时代:最新综述展示50多种多模态组合的巨大待探索空间
机器之心·2025-12-02 09:18