Workflow
TRIDENT
icon
Search documents
ACL 2025主会论文 | TRIDENT:基于三维多样化红队数据合成的LLM安全增强方法
机器之心· 2025-07-31 08:58
Core Insights - The article discusses the TRIDENT framework, which addresses the safety risks associated with large language models (LLMs) by introducing a three-dimensional diversification approach for red-teaming data synthesis [2][24]. Background - The safety risks of LLMs are a significant barrier to their widespread adoption, with existing datasets focusing primarily on vocabulary diversity rather than malicious intent and jailbreak strategy diversity [1][11]. Methodology - TRIDENT employs a persona-based and zero-shot automatic generation paradigm, combined with six jailbreak techniques, to produce high-quality red team data at low cost [2][5]. - The framework includes a three-dimensional risk coverage assessment that quantitatively measures diversity and balance across vocabulary, malicious intent, and jailbreak strategies [9]. Experimental Results - TRIDENT-CORE and TRIDENT-EDGE datasets were generated, containing 26,311 and 18,773 entries respectively, covering vocabulary and intent, as well as introducing jailbreak strategies [9]. - In comparative benchmarks, TRIDENT-EDGE models achieved the lowest average Harm Score and Attack Success Rate while maintaining or improving Helpful Rate compared to other datasets [20][22]. Breakthrough Significance - TRIDENT provides a sustainable and low-cost solution for LLM safety alignment, integrating seamlessly into existing training pipelines like RLHF and DPO [24]. - The framework is designed to evolve continuously with model updates and emerging threats, ensuring its relevance in a rapidly changing landscape [25].