从MLLM到Agent:万字长文览尽大模型安全进化之路!
自动驾驶之心·2025-09-03 23:33

Core Insights - The article discusses the evolution of large models from LLMs to MLLMs and then to Agents, highlighting the increasing capabilities and associated security risks, particularly focusing on jailbreak attacks as a significant threat [2][3][4]. Group 1: Evolution of Large Models - The transition from LLMs to MLLMs and then to Agents represents a significant paradigm shift in AI, with each stage introducing new capabilities and security challenges [7][16]. - LLMs, based on neural network breakthroughs, have limitations in handling multi-modal data, leading to the development of MLLMs that integrate text, image, and audio [8][12]. - MLLMs expand capabilities but also increase attack surfaces, allowing for more sophisticated jailbreak attacks that exploit visual and audio vulnerabilities [13][15]. Group 2: Jailbreak Attack Classification - The article proposes a dual-dimensional classification framework for jailbreak attacks based on "attack impact" and "attacker permissions," providing a comprehensive analysis of attack methods across different model types [25][32]. - Attacks are categorized into training phase and inference phase, with specific techniques such as backdoor attacks and prompt attacks identified [29][30]. - The classification also distinguishes between white-box and black-box attacks, emphasizing the varying levels of access attackers have to model internals [32][36]. Group 3: Data Sets and Evaluation Metrics - The article reviews existing datasets and evaluation metrics for jailbreak research, noting limitations in diversity and coverage, particularly in multi-modal and multi-turn scenarios [37][43]. - It categorizes datasets based on their sources and formats, highlighting the need for improved dynamic datasets that can keep pace with evolving attack strategies [39][41]. - Five main categories of evaluation metrics are discussed, including human evaluation, automated assessments, and custom metrics tailored to specific research needs [44][58].