Abstract:
In recent years, large language models (LLMs) have been widely applied in a range of downstream tasks and have demonstrated remarkable text understanding, generation, and reasoning capabilities in various fields. However, jailbreak attacks are emerging as a new threat to LLMs. Jailbreak attacks can bypass the security mechanisms of LLMs, weaken the influence of safety alignment, and induce harmful outputs from aligned LLMs. Issues such as abuse, hijacking and leakage caused by jailbreak attacks have posed serious threats to both dialogue systems and applications based on LLMs. We present a systematic review of jailbreak attacks in recent years, categorize these attacks into three distinct types based on their underlying mechanism: manually designed attacks, LLM-generated attacks, and optimization-based attacks. We provide a comprehensive summary of the core principles, implementation methods, and research findings derived from relevant studies, thoroughly examine the evolutionary trajectory of jailbreak attacks on LLMs, offering a valuable reference for future research endeavors. Moreover, a concise overview of the existing security measures is offered. It introduces pertinent techniques from the perspectives of internal defense and external defense, which aim to mitigate jailbreak attacks and enhance the content security of LLM generation. Finally, we delve into the existing challenges and frontier directions in the field of jailbreak attacks on LLMs, examine the potential of multimodal approaches, model editing, and multi-agent methodologies in tackling jailbreak attacks, providing valuable insights and research prospects to further advance the field of LLM security.