面向大语言模型的越狱攻击综述

李南; 丁益东; 江浩宇; 牛佳飞; 易平

doi:10.7544/issn1000-1239.202330962

摘要: 近年来，大语言模型（large language model，LLM）在一系列下游任务中得到了广泛应用，并在多个领域表现出了卓越的文本理解、生成与推理能力. 然而，越狱攻击正成为大语言模型的新兴威胁. 越狱攻击能够绕过大语言模型的安全机制，削弱价值观对齐的影响，诱使经过对齐的大语言模型产生有害输出. 越狱攻击带来的滥用、劫持、泄露等问题已对基于大语言模型的对话系统与应用程序造成了严重威胁. 对近年的越狱攻击研究进行了系统梳理，并基于攻击原理将其分为基于人工设计的攻击、基于模型生成的攻击与基于对抗性优化的攻击3类. 详细总结了相关研究的基本原理、实施方法与研究结论，全面回顾了大语言模型越狱攻击的发展历程，为后续的研究提供了有效参考. 对现有的安全措施进行了简略回顾，从内部防御与外部防御2个角度介绍了能够缓解越狱攻击并提高大语言模型生成内容安全性的相关技术，并对不同方法的利弊进行了罗列与比较. 在上述工作的基础上，对大语言模型越狱攻击领域的现存问题与前沿方向进行探讨，并结合多模态、模型编辑、多智能体等方向进行研究展望.

Abstract: In recent years, large language models (LLMs) have been widely applied in a range of downstream tasks and have demonstrated remarkable text understanding, generation, and reasoning capabilities in various fields. However, jailbreak attacks are emerging as a new threat to LLMs. Jailbreak attacks can bypass the security mechanisms of LLMs, weaken the influence of safety alignment, and induce harmful outputs from aligned LLMs. Issues such as abuse, hijacking and leakage caused by jailbreak attacks have posed serious threats to both dialogue systems and applications based on LLMs. We present a systematic review of jailbreak attacks in recent years, categorize these attacks into three distinct types based on their underlying mechanism: manually designed attacks, LLM-generated attacks, and optimization-based attacks. We provide a comprehensive summary of the core principles, implementation methods, and research findings derived from relevant studies, thoroughly examine the evolutionary trajectory of jailbreak attacks on LLMs, offering a valuable reference for future research endeavors. Moreover, a concise overview of the existing security measures is offered. It introduces pertinent techniques from the perspectives of internal defense and external defense, which aim to mitigate jailbreak attacks and enhance the content security of LLM generation. Finally, we delve into the existing challenges and frontier directions in the field of jailbreak attacks on LLMs, examine the potential of multimodal approaches, model editing, and multi-agent methodologies in tackling jailbreak attacks, providing valuable insights and research prospects to further advance the field of LLM security.

面向大语言模型的越狱攻击综述

Jailbreak Attack for Large Language Models: A Survey