Survey of Adversarial Attacks and Defenses for Large Language Models

Tai Jianwei; Yang Shuangning; Wang Jiajia; Li Yakai; Liu Qixu; Jia Xiaoqi

doi:10.7544/issn1000-1239.202440630

Journal of Computer Research and Development > 2025 > 62(3): 563-588. > DOI: 10.7544/issn1000-1239.202440630 CSTR: 32373.14.issn1000-1239.202440630

Tai Jianwei, Yang Shuangning, Wang Jiajia, Li Yakai, Liu Qixu, Jia Xiaoqi. Survey of Adversarial Attacks and Defenses for Large Language Models[J]. Journal of Computer Research and Development, 2025, 62(3): 563-588. DOI: 10.7544/issn1000-1239.202440630

Citation:

PDF (1329 KB)

Survey of Adversarial Attacks and Defenses for Large Language Models

1.
School of Internet, Anhui University, Hefei 230039
2.
Institute of Information Engineering, Chinese Academy of Sciences , Beijing 100093

Funds: This work was supported by the General Program of the National Natural Science Foundation of China (71971002) and the Anhui Provincial Natural Science Foundation(2108085QA35).

More Information

Author Bio:
Tai Jianwei: born in 1993. PhD, lecturer. Senior member of CCF. His main research interests include artificial intelligence applications, intelligent decision making and security in cyberspace. （24012@ahu.edu.cn）

Yang Shuangning: born in 2003. Master candidate. His main research interest includes large language model security

Wang Jiajia: born in 2004. Undergraduate. Her main research interest includes cyberspace security

Li Yakai: born in 1997. PhD candidate. His main research interests include artificial intelligence security and deep learning interpretability

Liu Qixu: born in 1984. PhD, professor. His main research interests include Web security and vulnerability mining

Jia Xiaoqi: born in 1982. PhD, professor. His main research interests include network attack and defence, operating system security, and cloud computing security
Received Date: July 20, 2024
Revised Date: January 19, 2025
Available Online: January 19, 2025

Graphical Abstract

Abstract

Abstract

With the rapid development of natural language processing and deep learning technologies, large language models (LLMs) have been increasingly applied in various fields such as text processing, language understanding, image generation, and code auditing. These models have become a research hotspot of common interest in both academia and industry. However, adversarial attack methods allow attackers to manipulate large language models into generating erroneous, unethical, or false content, posing increasingly severe security threats to these models and their wide-ranging applications. This paper systematically reviews recent advancements in adversarial attack methods and defense strategies for large language models. It provides a detailed summary of fundamental principles, implementation techniques, and major findings from relevant studies. Building on this foundation, the paper delves into technical discussions of four mainstream attack modes: prompt injection attacks, indirect prompt injection attacks, jailbreak attacks, and backdoor attacks. Each is analyzed in terms of its mechanisms, impacts, and potential risks. Furthermore, the paper discusses the current research status and future directions of large language models security, and outlooks the application prospects of large language models combined with multimodal data analysis and integration technologies. This review aims to enhance understanding of the field and foster more secure, reliable applications of large language models.
- large language models (LLMs),
- adversarial attack,
- defense strategy,
- cyberspace security,
- generative artificial intelligence

FullText(HTML)

References (117)

References

[1]	Matthew H. Hackers easily fool artificial intelligences-adversarial attacks highlight lack of security in machine learning algorithms[J]. Science, 2018, 361(6399): 215−215 doi: 10.1126/science.361.6399.215
[2]	Romera-Paredes B, Barekatain M, Novikov A, et al. Mathematical discoveries from program search with large language models[J]. Nature, 2024, 625(7995): 468−475 doi: 10.1038/s41586-023-06924-6
[3]	Radford A, Wu J, Child R, et al. OpenAI blog: Language models are unsupervised multitask learners [EB/OL]. 2024[2024-08-10]. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
[4]	Koroteev M V. BERT: A review of applications in natural language processing and understanding[J]. arXiv preprint, arXiv: 2103.11943, 2021
[5]	Touvron H, Lavril T, Izacard G, et al. LLaMA: Open and efficient foundation language models[J]. arXiv preprint, arXiv: 2302.13971, 2023
[6]	虎嵩林,李涓子,秦兵,等. 亦正亦邪大模型——大模型与安全专题导读[J]. 计算机研究与发展,2024,61(5):1085−1093 Hu Songlin, Li Juanzi, Qin Bing, et al. The double-edged swords: An introduction to the special issue on large models and safety[J]. Journal of Computer Research and Development, 2024, 61(5): 1085−1093 (in Chinese)
[7]	Lin S, Hilton J, Evans O. TruthfulQA: Measuring how models mimic human falsehoods[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2022: 3214−3252
[8]	Bommasani R, Hudson D A, Adeli E, et al. On the opportunities and risks of foundation models[J]. arXiv preprint, arXiv: 2108.07258, 2021
[9]	Weidinger L, Mellor J, Rauh M, et al. Ethical and social risks of harm from language models[J]. arXiv preprint, arXiv: 2112.04359, 2021
[10]	Hou Xinyi, Zhao Yanjie, Wang Haoyu. On the (in)security of LLM app stores[J]. arXiv preprint, arXiv: 2407.08422, 2024
[11]	Achintalwar S, Garcia A A, Anaby-tavor A, et al. Detectors for safe and reliable LLMs: Implementations, uses, and limitations[J]. arXiv preprint, arXiv: 2403.06009, 2024
[12]	王笑尘,张坤,张鹏. 多视角看大模型安全及实践[J]. 计算机研究与发展,2024,61(5):1104−1112 Wang Xiaochen, Zhang Kun, Zhang Peng. Large model safety and practice from multiple perspectives[J]. Journal of Computer Research and Development, 2024, 61(5): 1104−1112 (in Chinese)
[13]	Jacob D, Chang Mingwei, Kenton L, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proc of the 2019 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: ACL, 2019: 4171–4186
[14]	Wei J, Bosma M, Zhao V, et al. Finetuned language models are zero-shot learners[C/OL]//Proc of the 10th Int Conf on Learning Representations. Washington: ICLR, 2022[2024-08-11]. https://iclr.cc/virtual/2022/oral/6255
[15]	Ziegler D M, Stiennon N, Wu J, et al. Fine-tuning language models from human preferences[J]. arXiv preprint, arXiv: 1909.08593, 2019
[16]	Mao Xiaofeng, Chen Yuefeng, Jia Xiaojun, et al. Context-aware robust fine-tuning[J]. International Journal of Computer Vision, 2024, 132(5): 1685−1700 doi: 10.1007/s11263-023-01951-2
[17]	Kai Feng, Huang Lan, Wang Kanping, et al. Prompt-based learning framework for zero-shot cross-lingual text classification[J]. Engineering Applications of Artificial Intelligence, 2024, 133(E): 108481
[18]	Buckner C. Understanding adversarial examples requires a theory of artefacts for deep learning[J]. Nature Machine Intelligence, 2020, 2(12): 731−736 doi: 10.1038/s42256-020-00266-y
[19]	Liang Hongshuo, He Erlu, Zhao Yangyang, et al. Adversarial attack and defense: A survey[J]. Electronics, 2022, 11(8): 1283−1301 doi: 10.3390/electronics11081283
[20]	李南,丁益东,江浩宇,等. 面向大语言模型的越狱攻击综述[J]. 计算机研究与发展,2024,61(5):1156−1181 Li Nan, Ding Yidong, Jiang Haoyu, et al. Jailbreak attack for large language models: A survey[J]. Journal of Computer Research and Development, 2024, 61(5): 1156−1181 (in Chinese)
[21]	Sarabadani A, Halfaker A, Taraborelli D. Building automated vandalism detection tools for Wikidata [C]//Proc of the 26th Int Conf on World Wide Web Companion. New York: ACM, 2017: 1647−1654
[22]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proc of the 30th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2017: 5998−6008
[23]	Lan Jiahe, Wang Jie, Yan Baochen, et al. FlowMur: A stealthy and practical audio backdoor attack with limited knowledge[C]//Proc of the 2024 IEEE Symp on Security and Privacy (SP). Piscataway, NJ: IEEE, 2024: 1646−1664
[24]	Jayaraman B, Ghosh E, Chase M, et al. Combing for credentials: Active pattern extraction from smart reply[C]//Proc of the 2024 IEEE Symp on Security and Privacy (SP). Piscataway, NJ: IEEE, 2024: 1443−1461
[25]	Mor G, Daniel K, Elad S, et al. Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies[J]. Transations of the Association for Computational Linguistics, 2021, 9: 346−361 doi: 10.1162/tacl_a_00370
[26]	Laurencon H, Saylnier L, Thomas W, et al. The BigScience ROOTS Corpus: A 1.6TB composite multilingual dataset[C]//Proc of the 35th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2022: 31809−31826
[27]	Yuan Sha, Zhao Hanyu, Du Zhengxiao, et al. WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models[J]. AI Open, 2021, 2: 65−68 doi: 10.1016/j.aiopen.2021.06.001
[28]	Wan Jie, Fu Jianhao, Wang Lijin, et al. BounceAttack: A query-efficient decision-based adversarial attack by bouncing into the wild[C]//Proc of the 2024 IEEE Symp on Security and Privacy (SP). San Piscataway, NJ: IEEE, 2024: 1270−1286
[29]	Forest A, Matthew H, Peter S, et al. Learning activation functions to improve deep neural networks [J]. arXiv preprint, arXiv: 1412.6830, 2014
[30]	Henighan T, Kaplan J, Katz M, et al. Scaling laws for autoregressive generative modeling[J]. arXiv preprint, arXiv: 2010.14701, 2020
[31]	Agrawal K, Bhatnagar C. M-SAN: A patch-based transferable adversarial attack using the multi-stack adversarial network[J]. Journal of Electronic Imaging, 2023, 32(2): 023033
[32]	Tao G, Wang Zhenting, Feng Shiwei, et al. Distribution preserving backdoor attack in self-supervised learning[C]//Proc of the 2024 IEEE Symp on Security and Privacy (SP). Piscataway, NJ: IEEE, 2024: 2029−2047
[33]	Priyan V, Zhang Tianyi, Elena L. Expectation vs experience: Evaluating the usability of code generation tools powered by large language models[C]//Proc of the 2022 Chi Conf on Human Factors in Computing Systems Extended Abstracts. New York: ACM, 2022: 332: 1−332: 7
[34]	Zhang Junjie, Xie Ruobing, Hou Yupeng, et al. Recommendation as instruction following: A large language model empowered recommendation approach[J]. arXiv preprint, arXiv: 2305.07001, 2023
[35]	Goodfellow I J, Shlens J, Szegedy C. Explaining and harnessing adversarial examples[J]. arXiv preprint, arXiv: 1412.6572, 2014
[36]	Szegedy C, Zaremba W, Sutskever I, et al. Intriguing properties of neural networks[J]. arXiv preprint, arXiv: 1312.6199, 2013
[37]	Branch H J, Cefalu J R, Mchugh J, et al. Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples[J]. arXiv preprint, arXiv: 2209.02128, 2022
[38]	Perez F, Ribeiro I. Ignore previous prompt: Attack techniques for language models[J]. arXiv preprint, arXiv: 2211.09527, 2022
[39]	Kang D, Li Xuechen, Stoica I, et al. Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks[C] //Proc of the 2024 IEEE Security and Privacy Workshops (SPW). Piscataway, NJ: IEEE, 2024: 132−143
[40]	Toyer S, Watkins O, Mendes E A, et al. Tensor trust: Interpretable prompt injection attacks from an online game[C/OL] //Proc of the 12th Int Conf on Learning Representations. Washington: ICLR, 2024[2024-08-12]. https://openreview.net/forum?id=fsW7wJGLBd
[41]	Nccgroup. Exploring prompt injection attacks: NCC group research blog[EB/OL]. 2024[2024-08-10]. https://research.nccgroup.com/2022/12/05/exploring-prompt-injectionattacks/
[42]	Liu Yi, Deng Gelei, Li Yuekang, et al. Prompt injection attack against LLM-integrated applications[J]. arXiv preprint, arXiv: 2306.05499, 2023
[43]	Pedro R, Castro D, Carreira P, et al. From prompt injections to SQL injection attacks: How protected is your LLM-integrated web application?[J]. arXiv preprint, arXiv: 2308.01990, 2023
[44]	Liu Xiaogeng, Yu Zhiyuan, Zhang Yizhe, et al. Automatic and universal prompt injection attacks against large language models[J]. arXiv preprint, arXiv: 2403.04957, 2024
[45]	Zou A, Wang Zifan, Carlini N, et al. Universal and transferable adversarial attacks on aligned language models[J]. arXiv preprint, arXiv: 2307.15043, 2307
[46]	Shi Jiawen, Yuan Zenghui, Liu Yinuo, et al. Optimization-based prompt injection attack to LLM-as-a-judge[J]. arXiv preprint, arXiv: 2307.15043, 2307
[47]	Liu Yupei, Jia Yuqi, Geng Runpeng, et al. Formalizing and benchmarking prompt injection attacks and defenses[C] //Proc of the 33rd USENIX Security Symp (USENIX Security 24). Berkeley, CA: USENIX Association, 2024: 1831−1847
[48]	Sippo R, Alisia M M, Raghava R M, et al. An early categorization of prompt injection attacks on large language models[J]. arXiv preprint, arXiv: 2402.00898, 2024
[49]	Greshake K, Abdelnabi S, Mishra S, et al. More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models[J]. arXiv preprint, arXiv: 2302.12137, 2023
[50]	Bagdasaryan E, Hsieh T Y, Nassi B, et al. Abusing images and sounds for indirect instruction injection in multi-modal LLMs[J]. arXiv preprint, arXiv: 2302.10490, 2023
[51]	Zhan Qiusi, Liang Zhixiang, Ying Zifan, et al. InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents[C]//Proc of the 2024 Findings of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2024: 10471−10506
[52]	Shayegani E, Dong Yue, Abu-Ghazaleh N. Plug and pray: Exploiting off-the-shelf components of multi-modal models[J]. arXiv preprint, arXiv: 2307.14539, 2023
[53]	Liu Yi, Deng Gelei, Xu Zhengzi, et al. Jailbreaking ChatGPT via prompt engineering: An empirical study[J]. arXiv preprint, arXiv: 2305.13860, 2023
[54]	White J, Fu Quchen, Hays S, et al. A prompt pattern catalog to enhance prompt engineering with ChatGPT[J]. arXiv preprint, arXiv: 2302.11382, 2023
[55]	Wei A, Haghtalab N, Steinhardt J. Jailbroken: How does LLM safety training fail?[C]//Proc of the 36th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2024: 14−46
[56]	Yuan Youliang, Jiao Wenxiang, Wang Wenxuan, et al. GPT−4 is too smart to be safe: Stealthy chat with LLMs via cipher[C/OL]//Proc of the 12th Int Conf on Learning Representations. Washington: ICLR, 2024[2024-08-12]. https://openreview.net/forum?id=MbfAK4s61A
[57]	Zheng Yongxin, Menghini C, Bach S. Low-resource languages jailbreak GPT−4[J]. arXiv preprint, arXiv: 2310.02446, 2023
[58]	Jiang Fengqing, Xu Zhangchen, Niu Luyao, et al. ArtPrompt: ASCII art-based jailbreak attacks against aligned LLMs[C]//Proc of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2024: 15157−15173
[59]	Li Haoran, Guo Dadi, Fan Wei, et al. Multi-step jailbreaking privacy attacks on ChatGPT[C]//Proc of the 2023 Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg, PA: ACL, 2023: 4138−4153
[60]	Wang Jiongxiao, Liu Zichen, Park K, et al. Adversarial demonstration attacks on large language models[J]. arXiv preprint, arXiv: 2305.14950, 2023
[61]	Wei Zeming, Wang Yifei, Li Ang, et al. Jailbreak and guard aligned language models with only few in-context demonstrations[J]. arXiv preprint, arXiv: 2310.06387, 2023
[62]	Qiang Yao, Zhou Xiangyu, Zhu Dongxiao. Hijacking large language models via adversarial in-context learning[J]. arXiv preprint, arXiv: 2311.09948, 2023
[63]	Shen Xinyue, Chen Zeyuan, Backes M, et al. “Do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models[J]. arXiv preprint, arXiv: 2308.03825, 2023
[64]	Li Xuan, Zhou Zhanke, Zhu Jianing, et al. DeepInception: Hypnotize large language model to be jailbreaker[J]. arXiv preprint, arXiv: 2308.03191, 2023
[65]	Alayrac J B, Donahue J, Luc P, et al. Flamingo: A visual language model for few-shot learning[C]//Proc of the 35th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2022: 23716−23736
[66]	Google. Bard[EB/OL]. 2023[2024-08-12]. https://bard.google.com/
[67]	Carlini N, Nasr M, Choquette-Choo C A, et al. Are aligned neural networks adversarially aligned?[C]//Proc of the 36th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2024: 48−79
[68]	Qi Xiangyu, Huang Kaixuan, Panda A, et al. Visual adversarial examples jailbreak large language models[J]. arXiv preprint, arXiv: 2306.13213, 2023
[69]	Schlarmann C, Hein M. On the Adversarial robustness of multi-modal foundation models[C]//Proc of the 2023 IEEE/CVF Int Conf on Computer Vision Workshops (ICCVW). Piscataway, NJ: IEEE, 2023: 3677−3685
[70]	Zhao Yunqing, Pang Tianyu, Du Chao, et al. On evaluating adversarial robustness of large vision-language models[C]//Proc of the 36th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2024: 950−986
[71]	Dong Yinpeng, Chen Huanran, Chen Jiawei, et al. How robust is Google’s bard to adversarial image attacks?[J]. arXiv preprint, arXiv: 2309.11751, 2023
[72]	Deng Gelei, Liu Yi, Li Yuekang, et al. MasterKey: Automated jailbreaking of large language model chatbots[C/OL]//Proc of the 2024 Network and Distributed System Security Symp. Rosten, VA: Internet Society, 2024[2024-08-13]. https://www.ndss-symposium.org/ndss-paper/masterkey-automated-jailbreaking-of-large-language-model-chatbots/
[73]	Yao Dongyu, Zhang Jianshu, Harris I, et al. A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models[C]//Proc of the 2024 IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2024: 19−35
[74]	Yu Jiahao, Lin Xingwei, Yu Zheng, et al. GPTFUZZER: Red teaming large language models with auto-generated jailbreak prompts[J]. arXiv preprint, arXiv: 2309.10253, 2023
[75]	Wang Zimu, Wang Wei, Chen Qi, et al. Generating valid and natural adversarial examples with large language models[C]//Proc of the 27th Int Conf on Computer Supported Cooperative Work in Design (CSCWD). Piscataway, NJ: IEEE, 2024: 37−69
[76]	Chao P, Robey A, Dobriban E, et al. Jailbreaking black box large language models in twenty queries[J]. arXiv preprint, arXiv: 2310.08419, 2023
[77]	Mehrotra A, Zampetakis M, Kassianik P, et al. Tree of Attacks: Jailbreaking black-box LLMs automatically[J]. arXiv preprint, arXiv: 2310.02119, 2023
[78]	Liu Xiaogeng, Xu Nan, Chen Muhao, et al. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models[C/OL]//Proc of the 12th Int Conf on Learning Representations. Washington: ICLR, 2024[2024-08-12]. https://openreview.net/forum?id=7Jwpw4qKkb
[79]	Guo Ping, Liu Fei, Lin Xi, et al. L-AutoDA: Leveraging large language models for automated decision-based adversarial attacks[J]. arXiv preprint, arXiv: 2401.15335, 2024
[80]	Li Shaofeng, Liu Hui, Dong Tian, et al. Hidden backdoors in human-centric language models[C]//Proc of the 2021 ACM SIGSAC Conf on Computer and Communications Security. New York: ACM, 2021: 3123−3140
[81]	Wan A, Wallace E, Shen S, et al. Poisoning language models during instruction tuning[C]//Proc of the 40th Int Conf on Machine Learning. NEW York: PMLR, 2023: 35413−35425
[82]	Xu Jiashu, Ma Mingyu, Wang Fei, et al. Instructions as Backdoors: Backdoor vulnerabilities of instruction tuning for large language models[C]//Proc of the 2024 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2024: 3111−3126
[83]	Yang Wenkai, Bi Xiaohan, Lin Yankai, et al. Watch out for your agents! Investigating backdoor threats to LLM-based agents[J]. arXiv preprint, arXiv: 2402.11208, 2024
[84]	Hubinger E, Denison C, Mu J, et al. Sleeper agents: Training deceptive LLMs that persist through safety training[J]. arXiv preprint, arXiv: 2401.05566, 2024
[85]	Shi Jiawen, Liu Yixin, Zhou Pan, et al. Badgpt: Exploring security vulnerabilities of ChatGPT via backdoor attacks to instructgpt[J]. arXiv preprint, arXiv: 2304.12298, 2023
[86]	Dong Tian, Xue Minhui, Chen Guoxing, et al. Unleashing cheapfakes through trojan plugins of large language models[J]. arXiv preprint, arXiv: 2312.00374, 2023
[87]	Xiang Zhen, Jiang Fengqing, Xiong Zidi, et al. Badchain: Backdoor chain-of-thought prompting for large language models[C/OL]//Proc of the 12th Int Conf on Learning Representations. Washington: ICLR, 2024[2024-08-13]. https://openreview.net/forum?id=c93SBwz1Ma
[88]	朱素霞,王金印,孙广路. 基于感知相似性的多目标优化隐蔽图像后门攻击[J]. 计算机研究与发展,2024,61(5):1182−1192 Zhu Suxia, Wang Jinyin, Sun Guanglu. Perceptual similarity-based multi-objective optimization for stealthy image backdoor attack[J]. Journal of Computer Research and Development, 2024, 61(5): 1182−1192 (in Chinese)
[89]	Gabriel A, Michael K. Detecting language model attacks with perplexity[J]. arXiv preprint, arXiv: 2308.14132, 2023
[90]	Hu Zhengmian, Wu Gang, Saayan M, et al. Token-level adversarial prompt detection based on perplexity measures and contextual information[J]. arXiv preprint, arXiv: 2311.11509, 2023
[91]	Alon G, Kamfonas M. Detecting language model attacks with perplexity[J]. arXiv preprint, arXiv: 2308.14132, 2023
[92]	Lapid R, Langberg R, Sipper M. Open Sesame! Universal black box jailbreaking of large language models[J]. arXiv preprint, arXiv: 2309.01446, 2023
[93]	Zhu Sicheng, Zhang Ruiyi, An Bang, et al. Autodan: Automatic and interpretable adversarial attacks on large language models[J]. arXiv preprint, arXiv: 2310.15140, 2023
[94]	Robey A, Wong E, Hassani H, et al. SmoothLLM: Defending large language models against jailbreaking attacks[J]. arXiv preprint, arXiv: 2310.03684, 2023
[95]	Phute M, Helbling A, Hull M, et al. LLM Self Defense: By self examination, LLMs know they are being tricked[C/OL]//Proc of the 2th Tiny Papers Track at ICLR 2024. Washington: ICLR, 2024[2024-08-14]. https://openreview.net/forum?id=YoqgcIA19o
[96]	Glukhov D, Shumailov I, Gal Y, et al. LLM Censorship: A machine learning challenge or a computer security problem?[J]. arXiv preprint, arXiv: 2307.10719, 2023
[97]	Kumar A, Agarwal C, Srinivas S, et al. Certifying LLM safety against adversarial prompting[J]. arXiv preprint, arXiv: 2309.02705, 2023
[98]	Jain N, Schwarzschild A, Wen Y, et al. Baseline defenses for adversarial attacks against aligned language models[J]. arXiv preprint, arXiv: 2309.00614, 2023
[99]	Cao Bochuan, Cao Yuanpu, Lin Lu, et al. Defending against alignment-breaking attacks via robustly aligned LLM[C]//Proc of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2024: 10542−10560
[100]	Wang Hao, Li Hao, Huang Minlie, et al. From noise to clarity: Unraveling the adversarial suffix of large language model attacks via translation of text embeddings[J]. arXiv preprint, arXiv: 2402.16006, 2024
[101]	Ji Jiabao, Hou Bairu, Alexander R, et al. Defending large language models against jailbreak attacks via semantic smoothing[J]. arXiv preprint, arXiv: 2402.16192, 2024
[102]	Liu Xiaodong, Cheng Hao, He Pengcheng, et al. Adversarial training for large neural language models[J]. arXiv preprint, arXiv: 2004.08994, 2020
[103]	Ganguli D, Lovitt L, Kernion J, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned[J]. arXiv preprint, arXiv: 2209.07858, 2022
[104]	Perez E, Huang S, Song F, et al. Red teaming language models with language models[C]//Proc of the 2022 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2022: 3419−3448
[105]	Mitchell E, Lin C, Bosselut A. Memory-based model editing at scale [C]//Proc of the 39th Int Conf on Machine Learning. Stroudsburg, PA: ACL 2022: 15817−15831
[106]	Huang Zeyu, Shen Yikang, Zhang Xiaofeng, et al. Transformer-patcher: One mistake worth one neuron[C]//Proc of the 11th Int Conf on Learning Representations. Washington: ICLR, 2024[2024-08-14]. https://openreview.net/forum?id=4oYUGeGBPm
[107]	Meng K, Bau D, Andonian A, et al. Locating and editing factual associations in GPT[C]//Proc of the 35th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2022: 17359−17372
[108]	Meng K, Sharma A S, Andonian A, et al. Mass-editing memory in a transformer[C]//Proc of the 11th Int Conf on Learning Representations. Washington: ICLR, 2023[2024-08-13]. https://openreview.net/forum?id=MkbcAHIYgyS
[109]	王梦如,姚云志,习泽坤,等. 基于知识编辑的大模型内容生成安全分析[J]. 计算机研究与发展,2024,61(5):1143−1155 Wang Mengru, Yao Yunzhi, Xi Zekun, et al. Safety analysis of large model content generation based on knowledge editing[J]. Journal of Computer Research and Development, 2024, 61(5): 1143−1155 (in Chinese)
[110]	Li Yuhui, Wei Fangyun, Zhao Jinjing, et al. RAIN: Your language models can align themselves without finetuning[C/OL]//Proc of the 12th Int Conf on Learning Representations. Washington: ICLR, 2024[2024-08-13]. https://openreview.net/forum?id=pETSfWMUzy
[111]	Kim H, Yuk S, Cho H. Break the breakout: Reinventing LM defense against jailbreak attacks with self-refinement[J]. arXiv preprint, arXiv: 2402.15180, 2024
[112]	Zhou Yujun, Han Yufei, Zhuang Haomin, et al. Defending jailbreak prompts via in-context adversarial game[J]. arXiv preprint, arXiv: 2402.13148, 2024
[113]	Xu Zhangchen, Jiang Fengqing, Niu Luyao, et al. SafeDecoding: Defending against jailbreak attacks via safety-aware decoding[C]//Proc of the 62nd Annual Meeting of the ACL (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2024: 5587–5605
[114]	Zhang Zhexin, Yang Junxiao, Ke Pei, et al. Defending large language models against jailbreaking attacks through goal prioritization[C]//Proc of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2024: 8865–8887
[115]	Jin Haibo, Hu Leyang, Li Xinuo , et al. JailbreakZoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models[J]. arXiv preprint, arXiv: 2407.01599, 2024
[116]	Ying Zonghao, Liu Aishan, Liu Xianglong, et al. Unveiling the safety of GPT−4o: An empirical study using jailbreak attacks[J]. arXiv preprint, arXiv: 2406.06302, 2024
[117]	Wang Junyang, Xu Haiyang, Jia Haitao, et al. Mobile-Agent: Autonomous multi-modal mobile device agent with visual perception[J]. arXiv preprint, arXiv: 2401.16158, 2024