面向大语言模型的越狱攻击综述

李南; 丁益东; 江浩宇; 牛佳飞; 易平

doi:10.7544/issn1000-1239.202330962

面向大语言模型的越狱攻击综述

上海交通大学网络空间安全学院　上海　200240

基金项目: 国家自然科学基金项目（61831007）；国家重点研发计划(2020YFB1807504)

详细信息

作者简介:
李南: 2002年生. 硕士研究生. 主要研究方向为人工智能后门攻击、大语言模型安全

丁益东: 2001年生. 硕士研究生. 主要研究方向为人工智能后门攻击与防御、大语言模型

江浩宇: 1999年生. 硕士研究生. 主要研究方向为人工智能后门攻击、图神经网络

牛佳飞: 2001年生. 硕士研究生. 主要研究方向为人工智能后门、大语言模型安全

易平: 1969年生. 博士，副教授. CCF高级会员. 主要研究方向为人工智能安全、系统安全

通讯作者:
易平（yiping@sjtu.edu.cn）

中图分类号: TP391.1；TP18
计量
- 文章访问数: 1731
- HTML全文浏览量: 695
- PDF下载量: 529
出版历程
- 收稿日期: 2023-11-29
- 修回日期: 2024-01-30
- 网络出版日期: 2024-03-06
- 刊出日期: 2024-05-13

Jailbreak Attack for Large Language Models: A Survey

School of Cyber Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240

Funds: This work was supported by the National Natural Science Foundation of China (61831007), and the National Key Research and Development Program of China (2020YFB1807504).

More Information

Author Bio:
Li Nan: born in 2002. Master candidate. His main research interests include artificial intelligence backdoor attack and large language model security

Ding Yidong: born in 2001. Master candidate. His main research interests include artificial intelligence backdoor attack and defense, and large language models

Jiang Haoyu: born in 1999. Master candidate. His main research interests include artificial intelligence backdoor attack and graph neural network

Niu Jiafei: born in 2001. Master candidate. His main research interests include artificial intelligence backdoors and large language model security

Yi Ping: born in 1969. PhD, associate professor. Senior member of CCF. His main research interests include security for artificial intelligence and system security

摘要

摘要:
近年来，大语言模型（large language model，LLM）在一系列下游任务中得到了广泛应用，并在多个领域表现出了卓越的文本理解、生成与推理能力. 然而，越狱攻击正成为大语言模型的新兴威胁. 越狱攻击能够绕过大语言模型的安全机制，削弱价值观对齐的影响，诱使经过对齐的大语言模型产生有害输出. 越狱攻击带来的滥用、劫持、泄露等问题已对基于大语言模型的对话系统与应用程序造成了严重威胁. 对近年的越狱攻击研究进行了系统梳理，并基于攻击原理将其分为基于人工设计的攻击、基于模型生成的攻击与基于对抗性优化的攻击3类. 详细总结了相关研究的基本原理、实施方法与研究结论，全面回顾了大语言模型越狱攻击的发展历程，为后续的研究提供了有效参考. 对现有的安全措施进行了简略回顾，从内部防御与外部防御2个角度介绍了能够缓解越狱攻击并提高大语言模型生成内容安全性的相关技术，并对不同方法的利弊进行了罗列与比较. 在上述工作的基础上，对大语言模型越狱攻击领域的现存问题与前沿方向进行探讨，并结合多模态、模型编辑、多智能体等方向进行研究展望.
- 生成式人工智能 /
- 越狱攻击 /
- 大语言模型 /
- 自然语言处理 /
- 网络空间安全
Abstract:
In recent years, large language models (LLMs) have been widely applied in a range of downstream tasks and have demonstrated remarkable text understanding, generation, and reasoning capabilities in various fields. However, jailbreak attacks are emerging as a new threat to LLMs. Jailbreak attacks can bypass the security mechanisms of LLMs, weaken the influence of safety alignment, and induce harmful outputs from aligned LLMs. Issues such as abuse, hijacking and leakage caused by jailbreak attacks have posed serious threats to both dialogue systems and applications based on LLMs. We present a systematic review of jailbreak attacks in recent years, categorize these attacks into three distinct types based on their underlying mechanism: manually designed attacks, LLM-generated attacks, and optimization-based attacks. We provide a comprehensive summary of the core principles, implementation methods, and research findings derived from relevant studies, thoroughly examine the evolutionary trajectory of jailbreak attacks on LLMs, offering a valuable reference for future research endeavors. Moreover, a concise overview of the existing security measures is offered. It introduces pertinent techniques from the perspectives of internal defense and external defense, which aim to mitigate jailbreak attacks and enhance the content security of LLM generation. Finally, we delve into the existing challenges and frontier directions in the field of jailbreak attacks on LLMs, examine the potential of multimodal approaches, model editing, and multi-agent methodologies in tackling jailbreak attacks, providing valuable insights and research prospects to further advance the field of LLM security.
- generative artificial intelligence /
- jailbreak attack /
- large language model (LLM) /
- natural language processing (NLP) /
- cyber security

HTML全文

在实现分布式数据库的技术方案上，业界存在不同的选择. 第一种方式需要对应用系统进行拆分，通过分库分表将原本单个数据库管理的数据分散到多个集中式数据库. 分库分表方案要求应用系统重构，跨库访问效率较低，关系数据库的重要功能，如外键、全局唯一性约束、全局索引等无法使用. 第二种方式是对传统集中式关系数据库进行分布式改造，增加分布式事务处理，小规模集群部署下的自动故障恢复等功能. 这类分布式数据库由于存储系统、事务处理和SQL优化器等源自集中式架构，在分布式场景下面临功能和性能上的诸多限制. 第三种方式是从头开始设计和实现一个原生分布式关系数据库，将分布式作为基本特性融入存储系统、事务处理和SQL优化器等关键组件. 相比前两种方案，原生分布式数据库在高可用、数据一致性、事务性能、弹性伸缩、快速无损的故障恢复等方面有着更大的优势.

OceanBase是一个从头开始设计与实现的分布式关系数据库系统. OceanBase因淘宝而诞生，因支付宝而发展和壮大，如今已在金融、政务、通信和互联网等领域得到广泛应用. 由OceanBase首席科学家阳振坤领衔的分布式数据库研发团队实现了多项技术创新和突破，该团队撰写的论文“OceanBase分布式关系数据库架构与技术”介绍了OceanBase的分布式架构，分布式事务处理、存储引擎、SQL优化、多租户机制等关键技术，具体总结如下：

1）设计了强一致、高可用、可扩展的分布式事务处理机制，实现了单机/单机房故障的自动、无损、快速的故障恢复；

2）提出了单机/分布式一体化关系数据库架构，实现了关系数据库容量和处理能力从单机数据库到分布式数据库的无缝切换和伸缩；

3）实现了关系数据库的性能无损的高倍率数据压缩，论文实验展示了数据压缩倍率是主流关系数据库的3倍甚至更高；

4）实现了单数据库系统同时支持高性能事务处理和实时分析处理，典型场景的事务处理性能和分析处理性能都高于MySQL.

OceanBase是迄今为止唯一同时获得了TPC-C和TPC-H性能榜首的数据库. 尽管关系数据库的提出已经过去了半个世纪之久，真正意义上的分布式关系数据库时代才刚刚开始，论文不仅展示了OceanBase采用的分布式数据库关键技术，也对未来分布式数据库的发展方向提出了展望. 我相信，这篇论文能引发很多关于数据库发展方向的思考，对于从事相关研究和开发的工程技术人员和数据库应用领域的专业人士都有重要的参考价值.

评述专家

周傲英，教授，博士生导师. 主要研究方向为Web数据管理、数据密集型计算、内存集群计算、分布事务处理、大数据基准测试和性能优化.

亮点论文

阳振坤，杨传辉，韩富晟，王国平，杨志丰，成肖君. OceanBase分布式关系数据库架构与技术[J]. 计算机研究与发展，2024，61（3）：540−554. DOI:10.7544/issn1000-1239.202330835

图 1 本文框架图

Figure 1. Our proposed framework diagram

下载: 全尺寸图片幻灯片

图 2 越狱攻击示例

Figure 2. Example of the jailbreak attack

下载: 全尺寸图片幻灯片

图 3 越狱攻击机制

Figure 3. Mechanism of jailbreak attacks

下载: 全尺寸图片幻灯片

图 4 3种类型的越狱提示

Figure 4. Three types of jailbreak prompts

下载: 全尺寸图片幻灯片

图 5 上下文攻击与防御

Figure 5. In-context attack and defense

下载: 全尺寸图片幻灯片

图 6 PAIR攻击图解

Figure 6. Schematic of PAIR attack

下载: 全尺寸图片幻灯片

图 7 大语言模型在线对话系统中可能的防御措施

Figure 7. Possible defensive measures in online LLM chat system

下载: 全尺寸图片幻灯片

图 8 GCG针对GPT-3.5生成的对抗性提示

Figure 8. Adversarial prompt for GPT-3.5 generated by GCG

下载: 全尺寸图片幻灯片

图 9 基于检测与基于抑制的防御

Figure 9. Detection-based and mitigation-based defenses

下载: 全尺寸图片幻灯片

表 1 3种越狱攻击的对比

Table 1 Comparison of Three Jailbreak Attacks

攻击	威胁模型	提示可读性	是否自动化
基于人工设计的攻击	黑盒	是	否
基于模型生成的攻击	黑盒	是	是
基于对抗性优化的攻击	白盒或黑盒	否	是

下载: 导出CSV

表 2 基于人工设计的越狱攻击总结

Table 2 Summary of Manually Designed Jailbreak Attacks

分类	攻击方法	是否基于越狱提示	攻击原理
早期攻击	前缀注入^[21]	是	目标竞争
	拒绝抑制^[21]	是	目标竞争
	风格注入^[21]	是	目标竞争
	base64编码^[21]	否	不匹配的泛化
基于虚构场景的攻击	伪装^[22]	是	角色赋予与模拟场景
	注意力转移^[22]	是	改变上下文与任务
	权限提升^[22]	是	虚构高权限场景
	Deep Inception^[47]	是	虚构多层场景
基于上下文学习的攻击	ICA^[48]	是	利用模型的上下文学习能力
基于上下文学习的攻击	多步越狱^[49]	是	利用模型的上下文学习能力
基于生成策略的攻击	生成利用^[50]	否	调整生成超参数以破坏对齐
基于编码与翻译的攻击	低资源语言^[51]	否	不匹配的泛化
基于编码与翻译的攻击	CipherChat^[52]	否	不匹配的泛化

下载: 导出CSV

表 3 基于大语言模型生成的越狱攻击总结

Table 3 Summary of LLM-generated Jailbreak Attacks

分类	攻击方法	助手模型的作用	攻击原理
基于迭代优化的攻击	PAIR^[53]	生成并优化提示	利用助手模型多次修改以优化原始提示
基于模块化生成的攻击	PMA^[54]	组合并生成提示	利用助手模型组合多个提示模块，以针对性地生成基于角色扮演的越狱提示
基于模块化生成的攻击	PAP^[55]	生成部分提示	利用助手模型生成说服目标模型的文本
基于模糊测试的攻击	FuzzLLM^[56]	创造原始提示的变体	令模型组合原始提示，通过自我指导改写提示，增加提示数量
基于模糊测试的攻击	GPTFUZZER^[58]	创造原始提示的变体	利用模型对原始提示进行多种操作，以增加提示数量，追求多样性与有效性
基于防御分析的攻击	MasterKey^[60]	生成越狱提示	对攻击目标的外部防御措施进行基于时间的分析；在越狱数据集上微调助手模型使其能够生成更有效的越狱提示

下载: 导出CSV

表 4 基于对抗性优化的越狱攻击总结

Table 4 Summary of Adversarial Optimization-Based Jailbreak Attacks

分类	攻击方法	提示可读性	攻击特点
早期方法	AutoPrompt^[76]	否	基线对抗性攻击方法
	GBDA^[78]	否	基线对抗性攻击方法
	UTSC^[74]	是	诱发对话模型毒性的同时保持提示流畅性
白盒场景	ARCA^[81]	否	面向大语言模型的基线攻击方法
	GCG^[46]	否	通用且可转移的对抗性攻击
	AudoDAN^[84]	是	通过概率约束提高越狱提示可读性
黑盒场景	GA^[87]	否	利用遗传算法优化对抗性提示
	HGA^[88]	是	从基于人工的越狱提示出发，利用分层遗传算法以优化越狱提示
	JADE^[89]	是	基于语言学规则，对提示进行解析与变异

下载: 导出CSV

参考文献(137)

[1]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30: Annual Conf on Neural Information Processing Systems 2017. New York: Curran Associates, 2017: 5998−6008
[2]	Bender E M, Gebru T, McMillan-Major A, et al. On the dangers of stochastic parrots: Can language models be too big?[C]//Proc of the 2021 ACM Conf on Fairness, Accountability, and Transparency. New York: ACM, 2021: 610−623
[3]	OpenAI. GPT-4 technical report[J]. arXiv preprint, arXiv: 2305.10403, 2023
[4]	Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 1−24
[5]	Anil R, Dai A M, Firat O, et al. PaLM 2 technical report[J]. arXiv preprint, arXiv: 2305.10403, 2023
[6]	Touvron H, Martin L, Stone K, et al. LLaMA 2: Open foundation and fine-tuned chat models[J]. arXiv preprint, arXiv: 2307.09288, 2023
[7]	Sun Yu, Wang Shuohuan, Feng Shikun, et al. ERNIE 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation[J]. arXiv preprint, arXiv: 2107.02137, 2021
[8]	Du Zhengxiao, Qian Yujie, Liu Xiao, et al. GLM: General language model pretraining with autoregressive blank infilling[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2022: 320−335
[9]	Ren Xiaozhe, Zhou Pingyi, Meng Xinfan, et al. PanGu-Σ: Towards trillion parameter language model with sparse heterogeneous computing[J]. arXiv preprint, arXiv: 2303.10845, 2023
[10]	Bai Jinze, Bai Shuai, Yang Shusheng, et al. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond[J]. arXiv preprint, arXiv: 2308.12966, 2023
[11]	Bubeck S, Chandrasekaran V, Eldan R, et al. Sparks of artificial general intelligence: Early experiments with GPT-4[J]. arXiv preprint, arXiv: 2303.12712, 2023
[12]	Tamkin A, Brundage M, Clark J, et al. Understanding the capabilities, limitations, and societal impact of large language models[J]. arXiv preprint, arXiv: 2102.02503, 2021
[13]	Bommasani R, Hudson D A, Adeli E, et al. On the opportunities and risks of foundation models[J]. arXiv preprint, arXiv: 2108.07258, 2021
[14]	Weidinger L, Mellor J, Rauh M, et al. Ethical and social risks of harm from language models[J]. arXiv preprint, arXiv: 2112.04359, 2021
[15]	Lin S, Hilton J, Evans O. TruthfulQA: Measuring how models mimic human falsehoods[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2022: 3214−3252
[16]	Pal A, Umapathi L K, Sankarasubbu M. Med-HALT: Medical domain Hallucination test for large language models[C]//Proc of the 27th Conf on Computational Natural Language Learning. Stroudsburg, PA: ACL, 2023: 314−334
[17]	Wei J, Bosma M, Zhao V Y, et al. Finetuned language models are zero-shot learners[C]//Proc of the 10th Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2022: 1−46
[18]	Christiano P F, Leike J, Brown T B, et al. Deep reinforcement learning from human preferences[C]//Advances in Neural Information Processing Systems 30: Annual Conf on Neural Information Processing Systems 2017. New York: Curran Associates, 2017: 4299−4307
[19]	Ziegler D M, Stiennon N, Wu J, et al. Fine-Tuning language models from human preferences[J]. arXiv preprint, arXiv: 1909.08593, 2019
[20]	Yao Jing, Yi Xiaoyuan, Wang Xiting, et al. From instructions to intrinsic human values-A survey of alignment goals for big models[J]. arXiv preprint, arXiv: 2308.12014, 2023
[21]	Wei A, Haghtalab N, Steinhardt J. Jailbroken: How does LLM safety training fail?[J]. arXiv preprint, arXiv: 2307.02483, 2023
[22]	Liu Yi, Deng Gelei, Xu Zhengzi, et al. Jailbreaking ChatGPT via prompt engineering: An empirical study[J]. arXiv preprint, arXiv: 2305.13860, 2023
[23]	Albert A. Jailbreak chat[EB/OL]. [2023-11-15]. https://www.jailbreakchat.com
[24]	Bai Yuntao, Kadavath S, Kundu S, et al. Constitutional AI: Harmlessness from AI feedback[J]. arXiv preprint, arXiv: 2212.08073, 2022
[25]	Wang Jindong, Hu Xixu, Hou Wenxin, et al. On the robustness of ChatGPT: An adversarial and out-of-distribution perspective[J]. arXiv preprint, arXiv: 2302.12095, 2023
[26]	Zhuo T Y, Li Zhuang, Huang Yujin, et al. On robustness of prompt-based semantic parsing with large pre-trained language model: An empirical study on codex[C]//Proc of the 17th Conf of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 1090−1102
[27]	McKenzie I R, Lyzhov A, Pieler M, et al. Inverse scaling: When bigger isn’t better[J]. arXiv preprint, arXiv: 2306.09479, 2023
[28]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional Transformers for language understanding[C]//Proc of the 2019 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2019: 4171−4186
[29]	Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text Transformer[J]. Machine Learning Research, 2020, 21: 140: 1−140: 67
[30]	Pauls A, Klein D. Faster and smaller n-gram language models[C]//Proc of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. Stroudsburg, PA: ACL, 2011: 258−267
[31]	Mikolov T, Karafiát M, Burget L, et al. Recurrent neural network based language model[C]//Proc of the 11th Annual Conf of the Int Speech Communication Association (Interspeech 2010). New York: ISCA, 2010: 1045−1048
[32]	Laurençon H, Saulnier L, Wang T, et al. The BigScience ROOTS Corpus: A 1.6TB composite multilingual dataset[C]//Advances in Neural Information Processing Systems: Vol. 35. New York: Curran Associates, 2022: 31809−31826
[33]	Yuan Sha, Zhao Hanyu, Du Zhengxiao, et al. WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models[J]. AI Open, 2021, 2: 65−68 doi: 10.1016/j.aiopen.2021.06.001
[34]	Henighan T, Kaplan J, Katz M, et al. Scaling laws for autoregressive generative modeling[J]. arXiv preprint, arXiv: 2010.14701, 2020
[35]	Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems: Vol. 33. New York: Curran Associates, 2020: 1877−1901
[36]	Ouyang Long, Wu J, Jiang Xu, et al. Training language models to follow instructions with human feedback[C]//Advances in Neural Information Processing Systems: Vol. 35. New York: Curran Associates, 2022: 27730−27744
[37]	Wei J, Wang Xuezhi, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]//Advances in Neural Information Processing Systems: Vol. 35. New York: Curran Associates, 2022: 24824−24837
[38]	Vicuna Team. Vicuna: An open-source Chatbot impressing GPT-4 with 90% ChatGPT quality[EB/OL]. [2023-11-20]. https://lmsys.org/blog/2023-03-30-vicuna
[39]	Anthropic. Claude[EB/OL]. [2023-11-20].https://claude.ai
[40]	Shayegani E, Dong Yue, Abu-Ghazaleh N. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models[J]. arXiv preprint, arXiv: 2307.14539, 2023
[41]	WitchBOT. You can use GPT-4 to create prompt injections against GPT-4[EB/OL]. [2023-11-22]. https://www.lesswrong.com/posts/bNCDexejSZpkuu3yz/you-can-use-gpt-4-to-create-prompt-injections-against-gpt-4.
[42]	Bai Yuntao, Jones A, Ndousse K, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback[J]. arXiv preprint, arXiv: 2204.05862, 2022
[43]	Abdelnabi S, Greshake K, Mishra S, et al. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection[C]//Proc of the 16th ACM Workshop on Artificial Intelligence and Security. New York: ACM, 2023: 79−90
[44]	Shayegani E, Mamun M A A, Fu Yu, et al. Survey of vulnerabilities in large language models revealed by adversarial attacks[J]. arXiv preprint, arXiv: 2310.10844, 2023
[45]	Wolf Y, Wies N, Avnery O, et al. Fundamental limitations of alignment in large language models[J]. arXiv preprint, arXiv: 2304.11082, 2023
[46]	Zou A, Wang Zifan, Kolter J Z, et al. Universal and transferable adversarial attacks on aligned language models[J]. arXiv preprint, arXiv: 2307.15043, 2023
[47]	Li Xuan, Zhou Zhanke, Zhu Jianing, et al. DeepInception: Hypnotize large language model to be jailbreaker[J]. arXiv preprint, arXiv: 2311.03191, 2023
[48]	Wei Zeming, Wang Yifei, Wang Yisen. Jailbreak and guard aligned language models with only few in-context demonstrations[J]. arXiv preprint, arXiv: 2310.06387, 2023
[49]	Li Haoran, Guo Dadi, Fan Wei, et al. Multi-step jailbreaking privacy attacks on ChatGPT[J]. arXiv preprint, arXiv: 2304.05197, 2023
[50]	Huang Yangsibo, Gupta S, Xia Mengzhou, et al. Catastrophic jailbreak of open-source LLMs via exploiting generation[J]. arXiv preprint, arXiv: 2310.06987, 2023
[51]	Yong Z X, Menghini C, Bach S H. Low-resource languages jailbreak GPT-4[J]. arXiv preprint, arXiv: 2310.02446, 2023
[52]	Yuan Youliang, Jiao Wenxiang, Wang Wenxuan, et al. GPT-4 is too smart to be safe: Stealthy chat with LLMs via cipher[J]. arXiv preprint, arXiv: 2308.06463, 2023
[53]	Chao P, Robey A, Dobriban E, et al. Jailbreaking black box large language models in twenty queries[J]. arXiv preprint, arXiv: 2310.08419, 2023
[54]	Shah R, Feuillade--Montixi Q, Pour S, et al. Scalable and transferable black-box jailbreaks for language models via persona modulation[J]. arXiv preprint, arXiv: 2311.03348, 2023
[55]	Zeng Yi, Lin Hongpeng, Zhang Jingwen, et al. How Johnny can persuade LLMs to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing LLMs[J]. arXiv preprint, arXiv: 2401.06373, 2024
[56]	Yao Dongyu, Zhang Jianshu, Harris I G, et al. FuzzLLM: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models[J]. arXiv preprint, arXiv: 2309.05274, 2023
[57]	Wang Yizhong, Kordi Y, Mishra S, et al. Self-Instruct: Aligning language models with self-generated instructions[J]. arXiv preprint, arXiv: 2212.10560, 2022
[58]	Yu Jiahao, Lin Xingwei, Xing Xinyu, et al. GPTFUZZER: Red teaming large language models with auto-generated jailbreak prompts[J]. arXiv preprint, arXiv: 2309.10253
[59]	Coulom R. Efficient selectivity and backup operators in Monte-Carlo tree search[C]//Proc of the 5th Int Conf on Computers and Games. Berlin: Springer, 2006: 72−83
[60]	Deng Gelei, Liu Yi, Li Yuekang, et al. MasterKey: Automated jailbreak across multiple large language model Chatbots[J]. arXiv preprint, arXiv: 2307.08715, 2023
[61]	Microsoft. Bing Search[EB/OL]. [2023-11-10]. https://www.bing.com/
[62]	Google. ‎Google Bard[EB/OL]. [2023-11-22]. https://bard.google.com
[63]	Szegedy C, Zaremba W, Sutskever I, et al. Intriguing properties of neural networks[C]//Proc of the 2nd Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2014: 1−10
[64]	Biggio B, Corona I, Maiorca D, et al. Evasion attacks against machine learning at test time[C]//Proc of European Conf on Machine Learning and Knowledge Discovery in Databases. Berlin: Springer, 2013: 387−402
[65]	Papernot N, McDaniel P, Jha S, et al. The limitations of deep learning in adversarial settings[C]// Proc of 2016 IEEE European Symp on Security and Privacy. Piscataway, NJ: IEEE, 2016: 372−387
[66]	Carlini N, Wagner D. Towards evaluating the robustness of neural networks[C]//Proc of 2017 IEEE Symp on Security and Privacy. Piscataway, NJ: IEEE, 2017: 39−57
[67]	Jia R, Liang P. Adversarial examples for evaluating reading comprehension systems[C]//Proc of the 2017 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2017: 2021−2031
[68]	Wallace E, Feng Shi, Kandpal N, et al. Universal adversarial triggers for attacking and analyzing NLP[C]//Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing. Stroudsburg, PA: ACL, 2019: 2153−2162
[69]	Ebrahimi J, Rao A, Lowd D, et al. HotFlip: White-Box adversarial examples for text classification[C]//Proc of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2018: 31−36
[70]	Shao Zhihong, Wu Zhongqin, Huang Minlie. AdvExpander: Generating natural language adversarial examples by expanding text[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 30: 1184−1196
[71]	Madry A, Makelov A, Schmidt L, et al. Towards deep learning models resistant to adversarial attacks[C]// Proc of the 6th Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2018: 1−28
[72]	Ilyas A, Santurkar S, Tsipras D, et al. Adversarial examples are not bugs, they are features[C]//Advances in Neural Information Processing Systems 32: Annual Conf on Neural Information Processing Systems 2019. New York: Curran Associates, 2019: 125−136
[73]	Zhou Chunting, Sun Chonglin, Liu Zhiyuan, et al. A C-LSTM neural network for text classification[J]. arXiv preprint, arXiv: 1511.08630, 2015
[74]	Mehrabi N, Beirami A, Morstatter F, et al. Robust conversational agents against imperceptible toxicity triggers[C]//Proc of the 2022 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2022: 2831−2847
[75]	Zhang Yizhe, Sun Siqi, Galley M, et al. DialoGPT : Large-scale generative pre-training for conversational response generation[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Stroudsburg, PA: ACL, 2020: 270−278
[76]	Shin T, Razeghi Y, Robert L, et al. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts[C]//Proc of the 2020 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 4222−4235
[77]	Liu Yinhan, Ott M, Goyal N, et al. RoBERTa: A robustly optimized bert pretraining approach[J]. arXiv preprint, arXiv: 1907.11692, 2019
[78]	Guo Chuan, Sablayrolles A, Jégou H, et al. Gradient-based adversarial attacks against text Transformers[C]//Proc of the 2021 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 5747−5757
[79]	Jang E, Gu Shixiang, Poole B. Categorical reparameterization with Gumbel-Softmax[C]// Proc of the 5th Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2017: 1−13
[80]	Carlini N, Nasr M, Choquette-Choo C A, et al. Are aligned neural networks adversarially aligned?[J]. arXiv preprint, arXiv: 2306.15447, 2023
[81]	Jones E, Dragan A D, Raghunathan A, et al. Automatically auditing large language models via discrete optimization[C]// Proc of Int Conf on Machine Learning. New York: PMLR, 2023: 15307−15329
[82]	Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: Efficient finetuning of quantized LLMs[J]. arXiv preprint, arXiv: 2305.14314, 2023
[83]	Subhash V, Bialas A, Pan Weiwei, et al. Why do universal adversarial attacks work on large language models?: Geometry might be the answer[J]. arXiv preprint, arXiv: 2309.00254, 2023
[84]	Zhu Sicheng, Zhang Ruiyi, An Bang, et al. AutoDAN: Automatic and interpretable adversarial attacks on large language models[J]. arXiv preprint, arXiv: 2310.15140, 2023
[85]	Alon G, Kamfonas M. Detecting language model attacks with perplexity[J]. arXiv preprint, arXiv: 2308.14132, 2023
[86]	Jain N, Schwarzschild A, Wen Yuxin, et al. Baseline defenses for adversarial attacks against aligned language models[J]. arXiv preprint, arXiv: 2309.00614, 2023
[87]	Lapid R, Langberg R, Sipper M. Open Sesame! Universal black box jailbreaking of large language models[J]. arXiv preprint, arXiv: 2309.01446, 2023
[88]	Liu Xiaogeng, Xu Nan, Chen Muhao, et al. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models[J]. arXiv preprint, arXiv: 2310.04451, 2023
[89]	Zhang Mi, Pan Xudong, Yang Min. JADE: A linguistics-based safety evaluation platform for large language models[J]. arXiv preprint, arXiv: 2311.00286, 2023
[90]	Zhou Chunting, Liu Pengfei, Xu Puxin, et al. LIMA: Less is more for alignment[J]. arXiv preprint, arXiv: 2305.11206, 2023
[91]	Marchant A, Hawton K, Stewart A, et al. A systematic review of the relationship between internet use, self-harm and suicidal behaviour in young people: The good, the bad and the unknown[J]. PLOS ONE, 2017, 12(8): 1−26
[92]	Sobkowicz P, Sobkowicz A. Dynamics of hate based Internet user networks[J]. The European Physical Journal B, 2010, 73(4): 633−643 doi: 10.1140/epjb/e2010-00039-0
[93]	Boxell L, Gentzkow M, Shapiro J M. Is the Internet causing political polarization? Evidence from demographics: 23258[R]. New York: National Bureau of Economic Research, 2017
[94]	Akyürek E, Bolukbasi T, Liu F, et al. Towards tracing knowledge in language models back to the training data[C]//Findings of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 2429−2446
[95]	Gardent C, Shimorina A, Narayan S, et al. Creating training corpora for NLG micro-planners[C]//Proc of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2017: 179−188
[96]	Wang Hongmin. Revisiting challenges in data-to-text generation with fact grounding[C]//Proc of the 12th Int Conf on Natural Language Generation. Stroudsburg, PA: ACL, 2019: 311−322
[97]	Parikh A, Wang Xuezhi, Gehrmann S, et al. ToTTo: A controlled table-to-text generation dataset[C]//Proc of the 2020 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 1173−1186
[98]	Deng Jiawen, Sun Hao, Zhang Zhexin, et al. Recent advances towards safe, responsible, and moral dialogue systems: A survey[J]. arXiv preprint, arXiv: 2302.09270, 2023
[99]	Dinan E, Humeau S, Chintagunta B, et al. Build it break it fix it for dialogue safety: Robustness from adversarial human attack[C]//Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing. Stroudsburg, PA: ACL, 2019: 4537−4546
[100]	Penedo G, Malartic Q, Hesslow D, et al. The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only[J]. arXiv preprint, arXiv: 2306.01116, 2023
[101]	Wang Yida, Ke Pei, Zheng Yinhe, et al. A large-scale Chinese short-text conversation dataset[C]//Proc of the 9th CCF Int Conf on Natural Language Processing and Chinese Computing. Berlin: Springer, 2020: 91−103
[102]	Gu Yuxian, Wen Jiaxin, Sun Hao, et al. EVA2.0: Investigating open-domain Chinese dialogue systems with large-scale pre-training[J]. Machine Intelligence Research, 2023, 20: 207−219 doi: 10.1007/s11633-022-1387-3
[103]	Roller S, Dinan E, Goyal N, et al. Recipes for building an open-domain Chatbot[C]//Proc of the 16th Conf of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2021: 300−325
[104]	Baumgartner J, Zannettou S, Keegan B, et al. The Pushshift Reddit dataset[J]. arXiv preprint, arXiv: 2001.08435, 2020
[105]	Chung H W, Hou Le, Longpre S, et al. Scaling instruction-finetuned language models[J]. arXiv preprint, arXiv: 2210.11416, 2022
[106]	Taori R, Gulrajani I, Zhang Tianyi, et al. Stanford Alpaca: An instruction-following LLaMA model[EB/OL]. [2023-11-24]. https://github.com/tatsu-lab/stanford_alpaca.
[107]	Ji Jiaming, Liu Mickel, Dai Juntao, et al. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset[J]. arXiv preprint, arXiv: 2307.04657, 2023
[108]	Deng Yue, Zhang Wenxuan, Pan S J, et al. Multilingual jailbreak challenges in large language models[J]. arXiv preprint, arXiv: 2310.06474, 2023
[109]	Wang Zezhong, Yang Fangkai, Wang Lu, et al. Self-Guard: Empower the LLM to safeguard itself[J]. arXiv preprint, arXiv: 2310.15851, 2023
[110]	Zhang Zhexin, Yang Junxiao, Ke Pei, et al. Defending large language models against Jailbreaking attacks through goal prioritization[J]. arXiv preprint, arXiv: 2311.09096, 2023
[111]	Xie Yueqi, Yi Jingwei, Shao Jiawei, et al. Defending ChatGPT against jailbreak attack via self-reminders[J]. Nature Machine Intelligence, 2023, 5(12): 1486−1496
[112]	Perez F, Ribeiro I. Ignore previous prompt: Attack techniques for language models[J]. arXiv preprint, arXiv: 2211.09527, 2022
[113]	Li Yuhui, Wei Fangyun, Zhao Jinjing, et al. RAIN: Your language models can align themselves without finetuning[J]. arXiv preprint, arXiv: 2309.07124, 2023
[114]	Zhang Yuqi, Ding Liang, Zhang Lefei, et al. Intention analysis prompting makes large language models a good Jailbreak defender[J]. arXiv preprint, arXiv: 2401.06561, 2024
[115]	Jigsaw. Perspective API[EB/OL]. [2023-11-24]. https://www.perspectiveapi.com/
[116]	Markov T, Zhang Chong, Agarwal S, et al. A holistic approach to undesired content detection in the real world[C]//Proc of the AAAI Conf on Artificial Intelligence. Menlo Park, CA: AAAI, 2023, 37(12): 15009−15018
[117]	Kumar A, Agarwal C, Srinivas S, et al. Certifying LLM safety against adversarial prompting[J]. arXiv preprint, arXiv: 2309.02705, 2023
[118]	Cao Bochuan, Cao Yuanpu, Lin Lu, et al. Defending against alignment-breaking attacks via robustly aligned LLM[J]. arXiv preprint, arXiv: 2309.14348, 2023
[119]	Meng Dongyu, Chen Hao. Magnet: A two-pronged defense against adversarial examples[C]//Proc of the 2017 ACM SIGSAC Conf on Computer and Communications Security. New York: ACM, 2017: 135−147
[120]	Robey A, Wong E, Hassani H, et al. SmoothLLM: Defending large language models against jailbreaking attacks[J]. arXiv preprint, arXiv: 2310.03684, 2023
[121]	Zhu Deyao, Chen Jun, Shen Xiaoqian, et al. MiniGPT-4: Enhancing vision-language understanding with advanced large language models[J]. arXiv preprint, arXiv: 2304.10592, 2023
[122]	Liu Haotian, Li Chunyuan, Wu Qingyang, et al. Visual instruction tuning[J]. arXiv preprint, arXiv: 2304.08485, 2023
[123]	Wu Jian, Gaur Yashesh, Chen Zhuo, et al. On decoder-only architecture for speech-to-text and large language model integration[C]//Proc of 2023 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway, NJ: IEEE, 2023: 1−8
[124]	Maaz M, Rasheed H, Khan S, et al. Video-ChatGPT: Towards detailed video understanding via large vision and language models[J]. arXiv preprint, arXiv: 2306.05424, 2023
[125]	Sinitsin A, Plokhotnyuk V, Pyrkin D V, et al. Editable neural networks[C]// Proc of the 8th Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2020: 1−12
[126]	Lee N, Ping Wei, Xu Peng, et al. Factuality enhanced language models for open-ended text generation[C]//Advances in Neural Information Processing Systems. New York: Curran Associates, 2022: 34586−34599
[127]	Zhu Chen, Rawat A S, Zaheer M, et al. Modifying memories in transformer models[J]. arXiv preprint, arXiv: 2012.00363, 2020
[128]	Mitchell E, Lin C, Bosselut A, et al. Fast model editing at scale[C]//The Tenth Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2022: 1−21
[129]	Meng K, Bau D, Andonian A, et al. Locating and editing factual associations in GPT[J]. Advances in Neural Information Processing Systems, 2022, 35: 17359−17372
[130]	Pinter Y, Elhadad M. Emptying the ocean with a spoon: Should we edit models?[C]//Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg, PA: ACL, 2023: 15164−15172
[131]	Zou A, Phan L, Chen S, et al. Representation engineering: A top-down approach to AI transparency[J]. arXiv preprint, arXiv: 2310.01405, 2023
[132]	Li Tianlong, Zheng Xiaoqing, Huang Xuanjing. Open the Pandora’s Box of LLMs: Jailbreaking LLMs through representation engineering[J]. arXiv preprint, arXiv: 2401.06824, 2024
[133]	Huang Changran. The intelligent agent NLP-based customer service system[C]// Proc of 2021 2nd Int Conf on Artificial Intelligence in Electronics Engineering. New York: ACM, 2021: 41−50
[134]	Du Yilun, Li Shuang, Torralba A, et al. Improving factuality and reasoning in language models through multiagent debate[J]. arXiv preprint, arXiv: 2305.14325, 2023
[135]	Sadasivan V S, Kumar A, Balasubramanian S, et al. Can AI-generated text be reliably detected?[J]. arXiv preprint, arXiv: 2303.11156, 2023
[136]	Glukhov D, Shumailov I, Gal Y, et al. LLM censorship: A machine learning challenge or a computer security problem?[J]. arXiv preprint, arXiv: 2307.10719, 2023
[137]	Brcic M, Yampolskiy R V. Impossibility results in AI: A survey[J]. ACM Computing Surveys, 2024, 56(1): 8: 1−8: 24

施引文献(5)

期刊类型引用(2)

1.	樊青龙，耿磊，邓亚明，梁志斌，张敬文，董刚. 五举煤业智能洗选综合管控平台设计与应用. 选煤技术. 2025(01): 64-74 . 百度学术
2.	陈秀丽. 分布式数据库系统在云计算环境中的数据一致性保障机制. 信息与电脑(理论版). 2024(08): 137-139 . 百度学术