Jailbreak Attack for Large Language Models: A Survey

Li Nan; Ding Yidong; Jiang Haoyu; Niu Jiafei; Yi Ping

doi:10.7544/issn1000-1239.202330962

Journal of Computer Research and Development > 2024 > 61(5): 1156-1181. > DOI: 10.7544/issn1000-1239.202330962 CSTR: 32373.14.issn1000-1239.202330962

Li Nan, Ding Yidong, Jiang Haoyu, Niu Jiafei, Yi Ping. Jailbreak Attack for Large Language Models: A Survey[J]. Journal of Computer Research and Development, 2024, 61(5): 1156-1181. DOI: 10.7544/issn1000-1239.202330962

Citation:

PDF (1831 KB)

Jailbreak Attack for Large Language Models: A Survey

School of Cyber Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240

Funds: This work was supported by the National Natural Science Foundation of China (61831007), and the National Key Research and Development Program of China (2020YFB1807504).

More Information

Author Bio:
Li Nan: born in 2002. Master candidate. His main research interests include artificial intelligence backdoor attack and large language model security

Ding Yidong: born in 2001. Master candidate. His main research interests include artificial intelligence backdoor attack and defense, and large language models

Jiang Haoyu: born in 1999. Master candidate. His main research interests include artificial intelligence backdoor attack and graph neural network

Niu Jiafei: born in 2001. Master candidate. His main research interests include artificial intelligence backdoors and large language model security

Yi Ping: born in 1969. PhD, associate professor. Senior member of CCF. His main research interests include security for artificial intelligence and system security
Received Date: November 29, 2023
Revised Date: January 30, 2024
Available Online: March 06, 2024

Graphical Abstract

Abstract

Abstract

In recent years, large language models (LLMs) have been widely applied in a range of downstream tasks and have demonstrated remarkable text understanding, generation, and reasoning capabilities in various fields. However, jailbreak attacks are emerging as a new threat to LLMs. Jailbreak attacks can bypass the security mechanisms of LLMs, weaken the influence of safety alignment, and induce harmful outputs from aligned LLMs. Issues such as abuse, hijacking and leakage caused by jailbreak attacks have posed serious threats to both dialogue systems and applications based on LLMs. We present a systematic review of jailbreak attacks in recent years, categorize these attacks into three distinct types based on their underlying mechanism: manually designed attacks, LLM-generated attacks, and optimization-based attacks. We provide a comprehensive summary of the core principles, implementation methods, and research findings derived from relevant studies, thoroughly examine the evolutionary trajectory of jailbreak attacks on LLMs, offering a valuable reference for future research endeavors. Moreover, a concise overview of the existing security measures is offered. It introduces pertinent techniques from the perspectives of internal defense and external defense, which aim to mitigate jailbreak attacks and enhance the content security of LLM generation. Finally, we delve into the existing challenges and frontier directions in the field of jailbreak attacks on LLMs, examine the potential of multimodal approaches, model editing, and multi-agent methodologies in tackling jailbreak attacks, providing valuable insights and research prospects to further advance the field of LLM security.
- generative artificial intelligence,
- jailbreak attack,
- large language model (LLM),
- natural language processing (NLP),
- cyber security

FullText(HTML)

References (137)

References

[1]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30: Annual Conf on Neural Information Processing Systems 2017. New York: Curran Associates, 2017: 5998−6008
[2]	Bender E M, Gebru T, McMillan-Major A, et al. On the dangers of stochastic parrots: Can language models be too big?[C]//Proc of the 2021 ACM Conf on Fairness, Accountability, and Transparency. New York: ACM, 2021: 610−623
[3]	OpenAI. GPT-4 technical report[J]. arXiv preprint, arXiv: 2305.10403, 2023
[4]	Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 1−24
[5]	Anil R, Dai A M, Firat O, et al. PaLM 2 technical report[J]. arXiv preprint, arXiv: 2305.10403, 2023
[6]	Touvron H, Martin L, Stone K, et al. LLaMA 2: Open foundation and fine-tuned chat models[J]. arXiv preprint, arXiv: 2307.09288, 2023
[7]	Sun Yu, Wang Shuohuan, Feng Shikun, et al. ERNIE 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation[J]. arXiv preprint, arXiv: 2107.02137, 2021
[8]	Du Zhengxiao, Qian Yujie, Liu Xiao, et al. GLM: General language model pretraining with autoregressive blank infilling[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2022: 320−335
[9]	Ren Xiaozhe, Zhou Pingyi, Meng Xinfan, et al. PanGu-Σ: Towards trillion parameter language model with sparse heterogeneous computing[J]. arXiv preprint, arXiv: 2303.10845, 2023
[10]	Bai Jinze, Bai Shuai, Yang Shusheng, et al. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond[J]. arXiv preprint, arXiv: 2308.12966, 2023
[11]	Bubeck S, Chandrasekaran V, Eldan R, et al. Sparks of artificial general intelligence: Early experiments with GPT-4[J]. arXiv preprint, arXiv: 2303.12712, 2023
[12]	Tamkin A, Brundage M, Clark J, et al. Understanding the capabilities, limitations, and societal impact of large language models[J]. arXiv preprint, arXiv: 2102.02503, 2021
[13]	Bommasani R, Hudson D A, Adeli E, et al. On the opportunities and risks of foundation models[J]. arXiv preprint, arXiv: 2108.07258, 2021
[14]	Weidinger L, Mellor J, Rauh M, et al. Ethical and social risks of harm from language models[J]. arXiv preprint, arXiv: 2112.04359, 2021
[15]	Lin S, Hilton J, Evans O. TruthfulQA: Measuring how models mimic human falsehoods[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2022: 3214−3252
[16]	Pal A, Umapathi L K, Sankarasubbu M. Med-HALT: Medical domain Hallucination test for large language models[C]//Proc of the 27th Conf on Computational Natural Language Learning. Stroudsburg, PA: ACL, 2023: 314−334
[17]	Wei J, Bosma M, Zhao V Y, et al. Finetuned language models are zero-shot learners[C]//Proc of the 10th Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2022: 1−46
[18]	Christiano P F, Leike J, Brown T B, et al. Deep reinforcement learning from human preferences[C]//Advances in Neural Information Processing Systems 30: Annual Conf on Neural Information Processing Systems 2017. New York: Curran Associates, 2017: 4299−4307
[19]	Ziegler D M, Stiennon N, Wu J, et al. Fine-Tuning language models from human preferences[J]. arXiv preprint, arXiv: 1909.08593, 2019
[20]	Yao Jing, Yi Xiaoyuan, Wang Xiting, et al. From instructions to intrinsic human values-A survey of alignment goals for big models[J]. arXiv preprint, arXiv: 2308.12014, 2023
[21]	Wei A, Haghtalab N, Steinhardt J. Jailbroken: How does LLM safety training fail?[J]. arXiv preprint, arXiv: 2307.02483, 2023
[22]	Liu Yi, Deng Gelei, Xu Zhengzi, et al. Jailbreaking ChatGPT via prompt engineering: An empirical study[J]. arXiv preprint, arXiv: 2305.13860, 2023
[23]	Albert A. Jailbreak chat[EB/OL]. [2023-11-15]. https://www.jailbreakchat.com
[24]	Bai Yuntao, Kadavath S, Kundu S, et al. Constitutional AI: Harmlessness from AI feedback[J]. arXiv preprint, arXiv: 2212.08073, 2022
[25]	Wang Jindong, Hu Xixu, Hou Wenxin, et al. On the robustness of ChatGPT: An adversarial and out-of-distribution perspective[J]. arXiv preprint, arXiv: 2302.12095, 2023
[26]	Zhuo T Y, Li Zhuang, Huang Yujin, et al. On robustness of prompt-based semantic parsing with large pre-trained language model: An empirical study on codex[C]//Proc of the 17th Conf of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 1090−1102
[27]	McKenzie I R, Lyzhov A, Pieler M, et al. Inverse scaling: When bigger isn’t better[J]. arXiv preprint, arXiv: 2306.09479, 2023
[28]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional Transformers for language understanding[C]//Proc of the 2019 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2019: 4171−4186
[29]	Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text Transformer[J]. Machine Learning Research, 2020, 21: 140: 1−140: 67
[30]	Pauls A, Klein D. Faster and smaller n-gram language models[C]//Proc of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. Stroudsburg, PA: ACL, 2011: 258−267
[31]	Mikolov T, Karafiát M, Burget L, et al. Recurrent neural network based language model[C]//Proc of the 11th Annual Conf of the Int Speech Communication Association (Interspeech 2010). New York: ISCA, 2010: 1045−1048
[32]	Laurençon H, Saulnier L, Wang T, et al. The BigScience ROOTS Corpus: A 1.6TB composite multilingual dataset[C]//Advances in Neural Information Processing Systems: Vol. 35. New York: Curran Associates, 2022: 31809−31826
[33]	Yuan Sha, Zhao Hanyu, Du Zhengxiao, et al. WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models[J]. AI Open, 2021, 2: 65−68 doi: 10.1016/j.aiopen.2021.06.001
[34]	Henighan T, Kaplan J, Katz M, et al. Scaling laws for autoregressive generative modeling[J]. arXiv preprint, arXiv: 2010.14701, 2020
[35]	Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems: Vol. 33. New York: Curran Associates, 2020: 1877−1901
[36]	Ouyang Long, Wu J, Jiang Xu, et al. Training language models to follow instructions with human feedback[C]//Advances in Neural Information Processing Systems: Vol. 35. New York: Curran Associates, 2022: 27730−27744
[37]	Wei J, Wang Xuezhi, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]//Advances in Neural Information Processing Systems: Vol. 35. New York: Curran Associates, 2022: 24824−24837
[38]	Vicuna Team. Vicuna: An open-source Chatbot impressing GPT-4 with 90% ChatGPT quality[EB/OL]. [2023-11-20]. https://lmsys.org/blog/2023-03-30-vicuna
[39]	Anthropic. Claude[EB/OL]. [2023-11-20].https://claude.ai
[40]	Shayegani E, Dong Yue, Abu-Ghazaleh N. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models[J]. arXiv preprint, arXiv: 2307.14539, 2023
[41]	WitchBOT. You can use GPT-4 to create prompt injections against GPT-4[EB/OL]. [2023-11-22]. https://www.lesswrong.com/posts/bNCDexejSZpkuu3yz/you-can-use-gpt-4-to-create-prompt-injections-against-gpt-4.
[42]	Bai Yuntao, Jones A, Ndousse K, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback[J]. arXiv preprint, arXiv: 2204.05862, 2022
[43]	Abdelnabi S, Greshake K, Mishra S, et al. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection[C]//Proc of the 16th ACM Workshop on Artificial Intelligence and Security. New York: ACM, 2023: 79−90
[44]	Shayegani E, Mamun M A A, Fu Yu, et al. Survey of vulnerabilities in large language models revealed by adversarial attacks[J]. arXiv preprint, arXiv: 2310.10844, 2023
[45]	Wolf Y, Wies N, Avnery O, et al. Fundamental limitations of alignment in large language models[J]. arXiv preprint, arXiv: 2304.11082, 2023
[46]	Zou A, Wang Zifan, Kolter J Z, et al. Universal and transferable adversarial attacks on aligned language models[J]. arXiv preprint, arXiv: 2307.15043, 2023
[47]	Li Xuan, Zhou Zhanke, Zhu Jianing, et al. DeepInception: Hypnotize large language model to be jailbreaker[J]. arXiv preprint, arXiv: 2311.03191, 2023
[48]	Wei Zeming, Wang Yifei, Wang Yisen. Jailbreak and guard aligned language models with only few in-context demonstrations[J]. arXiv preprint, arXiv: 2310.06387, 2023
[49]	Li Haoran, Guo Dadi, Fan Wei, et al. Multi-step jailbreaking privacy attacks on ChatGPT[J]. arXiv preprint, arXiv: 2304.05197, 2023
[50]	Huang Yangsibo, Gupta S, Xia Mengzhou, et al. Catastrophic jailbreak of open-source LLMs via exploiting generation[J]. arXiv preprint, arXiv: 2310.06987, 2023
[51]	Yong Z X, Menghini C, Bach S H. Low-resource languages jailbreak GPT-4[J]. arXiv preprint, arXiv: 2310.02446, 2023
[52]	Yuan Youliang, Jiao Wenxiang, Wang Wenxuan, et al. GPT-4 is too smart to be safe: Stealthy chat with LLMs via cipher[J]. arXiv preprint, arXiv: 2308.06463, 2023
[53]	Chao P, Robey A, Dobriban E, et al. Jailbreaking black box large language models in twenty queries[J]. arXiv preprint, arXiv: 2310.08419, 2023
[54]	Shah R, Feuillade--Montixi Q, Pour S, et al. Scalable and transferable black-box jailbreaks for language models via persona modulation[J]. arXiv preprint, arXiv: 2311.03348, 2023
[55]	Zeng Yi, Lin Hongpeng, Zhang Jingwen, et al. How Johnny can persuade LLMs to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing LLMs[J]. arXiv preprint, arXiv: 2401.06373, 2024
[56]	Yao Dongyu, Zhang Jianshu, Harris I G, et al. FuzzLLM: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models[J]. arXiv preprint, arXiv: 2309.05274, 2023
[57]	Wang Yizhong, Kordi Y, Mishra S, et al. Self-Instruct: Aligning language models with self-generated instructions[J]. arXiv preprint, arXiv: 2212.10560, 2022
[58]	Yu Jiahao, Lin Xingwei, Xing Xinyu, et al. GPTFUZZER: Red teaming large language models with auto-generated jailbreak prompts[J]. arXiv preprint, arXiv: 2309.10253
[59]	Coulom R. Efficient selectivity and backup operators in Monte-Carlo tree search[C]//Proc of the 5th Int Conf on Computers and Games. Berlin: Springer, 2006: 72−83
[60]	Deng Gelei, Liu Yi, Li Yuekang, et al. MasterKey: Automated jailbreak across multiple large language model Chatbots[J]. arXiv preprint, arXiv: 2307.08715, 2023
[61]	Microsoft. Bing Search[EB/OL]. [2023-11-10]. https://www.bing.com/
[62]	Google. ‎Google Bard[EB/OL]. [2023-11-22]. https://bard.google.com
[63]	Szegedy C, Zaremba W, Sutskever I, et al. Intriguing properties of neural networks[C]//Proc of the 2nd Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2014: 1−10
[64]	Biggio B, Corona I, Maiorca D, et al. Evasion attacks against machine learning at test time[C]//Proc of European Conf on Machine Learning and Knowledge Discovery in Databases. Berlin: Springer, 2013: 387−402
[65]	Papernot N, McDaniel P, Jha S, et al. The limitations of deep learning in adversarial settings[C]// Proc of 2016 IEEE European Symp on Security and Privacy. Piscataway, NJ: IEEE, 2016: 372−387
[66]	Carlini N, Wagner D. Towards evaluating the robustness of neural networks[C]//Proc of 2017 IEEE Symp on Security and Privacy. Piscataway, NJ: IEEE, 2017: 39−57
[67]	Jia R, Liang P. Adversarial examples for evaluating reading comprehension systems[C]//Proc of the 2017 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2017: 2021−2031
[68]	Wallace E, Feng Shi, Kandpal N, et al. Universal adversarial triggers for attacking and analyzing NLP[C]//Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing. Stroudsburg, PA: ACL, 2019: 2153−2162
[69]	Ebrahimi J, Rao A, Lowd D, et al. HotFlip: White-Box adversarial examples for text classification[C]//Proc of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2018: 31−36
[70]	Shao Zhihong, Wu Zhongqin, Huang Minlie. AdvExpander: Generating natural language adversarial examples by expanding text[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 30: 1184−1196
[71]	Madry A, Makelov A, Schmidt L, et al. Towards deep learning models resistant to adversarial attacks[C]// Proc of the 6th Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2018: 1−28
[72]	Ilyas A, Santurkar S, Tsipras D, et al. Adversarial examples are not bugs, they are features[C]//Advances in Neural Information Processing Systems 32: Annual Conf on Neural Information Processing Systems 2019. New York: Curran Associates, 2019: 125−136
[73]	Zhou Chunting, Sun Chonglin, Liu Zhiyuan, et al. A C-LSTM neural network for text classification[J]. arXiv preprint, arXiv: 1511.08630, 2015
[74]	Mehrabi N, Beirami A, Morstatter F, et al. Robust conversational agents against imperceptible toxicity triggers[C]//Proc of the 2022 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2022: 2831−2847
[75]	Zhang Yizhe, Sun Siqi, Galley M, et al. DialoGPT : Large-scale generative pre-training for conversational response generation[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Stroudsburg, PA: ACL, 2020: 270−278
[76]	Shin T, Razeghi Y, Robert L, et al. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts[C]//Proc of the 2020 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 4222−4235
[77]	Liu Yinhan, Ott M, Goyal N, et al. RoBERTa: A robustly optimized bert pretraining approach[J]. arXiv preprint, arXiv: 1907.11692, 2019
[78]	Guo Chuan, Sablayrolles A, Jégou H, et al. Gradient-based adversarial attacks against text Transformers[C]//Proc of the 2021 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 5747−5757
[79]	Jang E, Gu Shixiang, Poole B. Categorical reparameterization with Gumbel-Softmax[C]// Proc of the 5th Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2017: 1−13
[80]	Carlini N, Nasr M, Choquette-Choo C A, et al. Are aligned neural networks adversarially aligned?[J]. arXiv preprint, arXiv: 2306.15447, 2023
[81]	Jones E, Dragan A D, Raghunathan A, et al. Automatically auditing large language models via discrete optimization[C]// Proc of Int Conf on Machine Learning. New York: PMLR, 2023: 15307−15329
[82]	Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: Efficient finetuning of quantized LLMs[J]. arXiv preprint, arXiv: 2305.14314, 2023
[83]	Subhash V, Bialas A, Pan Weiwei, et al. Why do universal adversarial attacks work on large language models?: Geometry might be the answer[J]. arXiv preprint, arXiv: 2309.00254, 2023
[84]	Zhu Sicheng, Zhang Ruiyi, An Bang, et al. AutoDAN: Automatic and interpretable adversarial attacks on large language models[J]. arXiv preprint, arXiv: 2310.15140, 2023
[85]	Alon G, Kamfonas M. Detecting language model attacks with perplexity[J]. arXiv preprint, arXiv: 2308.14132, 2023
[86]	Jain N, Schwarzschild A, Wen Yuxin, et al. Baseline defenses for adversarial attacks against aligned language models[J]. arXiv preprint, arXiv: 2309.00614, 2023
[87]	Lapid R, Langberg R, Sipper M. Open Sesame! Universal black box jailbreaking of large language models[J]. arXiv preprint, arXiv: 2309.01446, 2023
[88]	Liu Xiaogeng, Xu Nan, Chen Muhao, et al. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models[J]. arXiv preprint, arXiv: 2310.04451, 2023
[89]	Zhang Mi, Pan Xudong, Yang Min. JADE: A linguistics-based safety evaluation platform for large language models[J]. arXiv preprint, arXiv: 2311.00286, 2023
[90]	Zhou Chunting, Liu Pengfei, Xu Puxin, et al. LIMA: Less is more for alignment[J]. arXiv preprint, arXiv: 2305.11206, 2023
[91]	Marchant A, Hawton K, Stewart A, et al. A systematic review of the relationship between internet use, self-harm and suicidal behaviour in young people: The good, the bad and the unknown[J]. PLOS ONE, 2017, 12(8): 1−26
[92]	Sobkowicz P, Sobkowicz A. Dynamics of hate based Internet user networks[J]. The European Physical Journal B, 2010, 73(4): 633−643 doi: 10.1140/epjb/e2010-00039-0
[93]	Boxell L, Gentzkow M, Shapiro J M. Is the Internet causing political polarization? Evidence from demographics: 23258[R]. New York: National Bureau of Economic Research, 2017
[94]	Akyürek E, Bolukbasi T, Liu F, et al. Towards tracing knowledge in language models back to the training data[C]//Findings of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 2429−2446
[95]	Gardent C, Shimorina A, Narayan S, et al. Creating training corpora for NLG micro-planners[C]//Proc of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2017: 179−188
[96]	Wang Hongmin. Revisiting challenges in data-to-text generation with fact grounding[C]//Proc of the 12th Int Conf on Natural Language Generation. Stroudsburg, PA: ACL, 2019: 311−322
[97]	Parikh A, Wang Xuezhi, Gehrmann S, et al. ToTTo: A controlled table-to-text generation dataset[C]//Proc of the 2020 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 1173−1186
[98]	Deng Jiawen, Sun Hao, Zhang Zhexin, et al. Recent advances towards safe, responsible, and moral dialogue systems: A survey[J]. arXiv preprint, arXiv: 2302.09270, 2023
[99]	Dinan E, Humeau S, Chintagunta B, et al. Build it break it fix it for dialogue safety: Robustness from adversarial human attack[C]//Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing. Stroudsburg, PA: ACL, 2019: 4537−4546
[100]	Penedo G, Malartic Q, Hesslow D, et al. The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only[J]. arXiv preprint, arXiv: 2306.01116, 2023
[101]	Wang Yida, Ke Pei, Zheng Yinhe, et al. A large-scale Chinese short-text conversation dataset[C]//Proc of the 9th CCF Int Conf on Natural Language Processing and Chinese Computing. Berlin: Springer, 2020: 91−103
[102]	Gu Yuxian, Wen Jiaxin, Sun Hao, et al. EVA2.0: Investigating open-domain Chinese dialogue systems with large-scale pre-training[J]. Machine Intelligence Research, 2023, 20: 207−219 doi: 10.1007/s11633-022-1387-3
[103]	Roller S, Dinan E, Goyal N, et al. Recipes for building an open-domain Chatbot[C]//Proc of the 16th Conf of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2021: 300−325
[104]	Baumgartner J, Zannettou S, Keegan B, et al. The Pushshift Reddit dataset[J]. arXiv preprint, arXiv: 2001.08435, 2020
[105]	Chung H W, Hou Le, Longpre S, et al. Scaling instruction-finetuned language models[J]. arXiv preprint, arXiv: 2210.11416, 2022
[106]	Taori R, Gulrajani I, Zhang Tianyi, et al. Stanford Alpaca: An instruction-following LLaMA model[EB/OL]. [2023-11-24]. https://github.com/tatsu-lab/stanford_alpaca.
[107]	Ji Jiaming, Liu Mickel, Dai Juntao, et al. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset[J]. arXiv preprint, arXiv: 2307.04657, 2023
[108]	Deng Yue, Zhang Wenxuan, Pan S J, et al. Multilingual jailbreak challenges in large language models[J]. arXiv preprint, arXiv: 2310.06474, 2023
[109]	Wang Zezhong, Yang Fangkai, Wang Lu, et al. Self-Guard: Empower the LLM to safeguard itself[J]. arXiv preprint, arXiv: 2310.15851, 2023
[110]	Zhang Zhexin, Yang Junxiao, Ke Pei, et al. Defending large language models against Jailbreaking attacks through goal prioritization[J]. arXiv preprint, arXiv: 2311.09096, 2023
[111]	Xie Yueqi, Yi Jingwei, Shao Jiawei, et al. Defending ChatGPT against jailbreak attack via self-reminders[J]. Nature Machine Intelligence, 2023, 5(12): 1486−1496
[112]	Perez F, Ribeiro I. Ignore previous prompt: Attack techniques for language models[J]. arXiv preprint, arXiv: 2211.09527, 2022
[113]	Li Yuhui, Wei Fangyun, Zhao Jinjing, et al. RAIN: Your language models can align themselves without finetuning[J]. arXiv preprint, arXiv: 2309.07124, 2023
[114]	Zhang Yuqi, Ding Liang, Zhang Lefei, et al. Intention analysis prompting makes large language models a good Jailbreak defender[J]. arXiv preprint, arXiv: 2401.06561, 2024
[115]	Jigsaw. Perspective API[EB/OL]. [2023-11-24]. https://www.perspectiveapi.com/
[116]	Markov T, Zhang Chong, Agarwal S, et al. A holistic approach to undesired content detection in the real world[C]//Proc of the AAAI Conf on Artificial Intelligence. Menlo Park, CA: AAAI, 2023, 37(12): 15009−15018
[117]	Kumar A, Agarwal C, Srinivas S, et al. Certifying LLM safety against adversarial prompting[J]. arXiv preprint, arXiv: 2309.02705, 2023
[118]	Cao Bochuan, Cao Yuanpu, Lin Lu, et al. Defending against alignment-breaking attacks via robustly aligned LLM[J]. arXiv preprint, arXiv: 2309.14348, 2023
[119]	Meng Dongyu, Chen Hao. Magnet: A two-pronged defense against adversarial examples[C]//Proc of the 2017 ACM SIGSAC Conf on Computer and Communications Security. New York: ACM, 2017: 135−147
[120]	Robey A, Wong E, Hassani H, et al. SmoothLLM: Defending large language models against jailbreaking attacks[J]. arXiv preprint, arXiv: 2310.03684, 2023
[121]	Zhu Deyao, Chen Jun, Shen Xiaoqian, et al. MiniGPT-4: Enhancing vision-language understanding with advanced large language models[J]. arXiv preprint, arXiv: 2304.10592, 2023
[122]	Liu Haotian, Li Chunyuan, Wu Qingyang, et al. Visual instruction tuning[J]. arXiv preprint, arXiv: 2304.08485, 2023
[123]	Wu Jian, Gaur Yashesh, Chen Zhuo, et al. On decoder-only architecture for speech-to-text and large language model integration[C]//Proc of 2023 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway, NJ: IEEE, 2023: 1−8
[124]	Maaz M, Rasheed H, Khan S, et al. Video-ChatGPT: Towards detailed video understanding via large vision and language models[J]. arXiv preprint, arXiv: 2306.05424, 2023
[125]	Sinitsin A, Plokhotnyuk V, Pyrkin D V, et al. Editable neural networks[C]// Proc of the 8th Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2020: 1−12
[126]	Lee N, Ping Wei, Xu Peng, et al. Factuality enhanced language models for open-ended text generation[C]//Advances in Neural Information Processing Systems. New York: Curran Associates, 2022: 34586−34599
[127]	Zhu Chen, Rawat A S, Zaheer M, et al. Modifying memories in transformer models[J]. arXiv preprint, arXiv: 2012.00363, 2020
[128]	Mitchell E, Lin C, Bosselut A, et al. Fast model editing at scale[C]//The Tenth Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2022: 1−21
[129]	Meng K, Bau D, Andonian A, et al. Locating and editing factual associations in GPT[J]. Advances in Neural Information Processing Systems, 2022, 35: 17359−17372
[130]	Pinter Y, Elhadad M. Emptying the ocean with a spoon: Should we edit models?[C]//Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg, PA: ACL, 2023: 15164−15172
[131]	Zou A, Phan L, Chen S, et al. Representation engineering: A top-down approach to AI transparency[J]. arXiv preprint, arXiv: 2310.01405, 2023
[132]	Li Tianlong, Zheng Xiaoqing, Huang Xuanjing. Open the Pandora’s Box of LLMs: Jailbreaking LLMs through representation engineering[J]. arXiv preprint, arXiv: 2401.06824, 2024
[133]	Huang Changran. The intelligent agent NLP-based customer service system[C]// Proc of 2021 2nd Int Conf on Artificial Intelligence in Electronics Engineering. New York: ACM, 2021: 41−50
[134]	Du Yilun, Li Shuang, Torralba A, et al. Improving factuality and reasoning in language models through multiagent debate[J]. arXiv preprint, arXiv: 2305.14325, 2023
[135]	Sadasivan V S, Kumar A, Balasubramanian S, et al. Can AI-generated text be reliably detected?[J]. arXiv preprint, arXiv: 2303.11156, 2023
[136]	Glukhov D, Shumailov I, Gal Y, et al. LLM censorship: A machine learning challenge or a computer security problem?[J]. arXiv preprint, arXiv: 2307.10719, 2023
[137]	Brcic M, Yampolskiy R V. Impossibility results in AI: A survey[J]. ACM Computing Surveys, 2024, 56(1): 8: 1−8: 24

Cited By

Cited by

Periodical cited type(3)

1.	台建玮，杨双宁，王佳佳，李亚凯，刘奇旭，贾晓启. 大语言模型对抗性攻击与防御综述. 计算机研究与发展. 2025(03): 563-588 . 本站查看
2.	布文茹，王昊，李晓敏，周抒，邓三鸿. 古诗词中的探赜索隐：决策层融合大模型修正的典故引用识别方法. 科技情报研究. 2024(04): 37-52 .
3.	付志远，陈思宇，陈骏帆，海翔，石岩松，李晓琦，李益红，岳秋玲，张玉清. 大语言模型安全的挑战与机遇. 信息安全学报. 2024(05): 26-55 .