• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Li Nan, Ding Yidong, Jiang Haoyu, Niu Jiafei, Yi Ping. Jailbreak Attack for Large Language Models: A Survey[J]. Journal of Computer Research and Development, 2024, 61(5): 1156-1181. DOI: 10.7544/issn1000-1239.202330962
Citation: Li Nan, Ding Yidong, Jiang Haoyu, Niu Jiafei, Yi Ping. Jailbreak Attack for Large Language Models: A Survey[J]. Journal of Computer Research and Development, 2024, 61(5): 1156-1181. DOI: 10.7544/issn1000-1239.202330962

Jailbreak Attack for Large Language Models: A Survey

Funds: This work was supported by the National Natural Science Foundation of China (61831007), and the National Key Research and Development Program of China (2020YFB1807504).
More Information
  • Author Bio:

    Li Nan: born in 2002. Master candidate. His main research interests include artificial intelligence backdoor attack and large language model security

    Ding Yidong: born in 2001. Master candidate. His main research interests include artificial intelligence backdoor attack and defense, and large language models

    Jiang Haoyu: born in 1999. Master candidate. His main research interests include artificial intelligence backdoor attack and graph neural network

    Niu Jiafei: born in 2001. Master candidate. His main research interests include artificial intelligence backdoors and large language model security

    Yi Ping: born in 1969. PhD, associate professor. Senior member of CCF. His main research interests include security for artificial intelligence and system security

  • Received Date: November 29, 2023
  • Revised Date: January 30, 2024
  • Available Online: March 06, 2024
  • In recent years, large language models (LLMs) have been widely applied in a range of downstream tasks and have demonstrated remarkable text understanding, generation, and reasoning capabilities in various fields. However, jailbreak attacks are emerging as a new threat to LLMs. Jailbreak attacks can bypass the security mechanisms of LLMs, weaken the influence of safety alignment, and induce harmful outputs from aligned LLMs. Issues such as abuse, hijacking and leakage caused by jailbreak attacks have posed serious threats to both dialogue systems and applications based on LLMs. We present a systematic review of jailbreak attacks in recent years, categorize these attacks into three distinct types based on their underlying mechanism: manually designed attacks, LLM-generated attacks, and optimization-based attacks. We provide a comprehensive summary of the core principles, implementation methods, and research findings derived from relevant studies, thoroughly examine the evolutionary trajectory of jailbreak attacks on LLMs, offering a valuable reference for future research endeavors. Moreover, a concise overview of the existing security measures is offered. It introduces pertinent techniques from the perspectives of internal defense and external defense, which aim to mitigate jailbreak attacks and enhance the content security of LLM generation. Finally, we delve into the existing challenges and frontier directions in the field of jailbreak attacks on LLMs, examine the potential of multimodal approaches, model editing, and multi-agent methodologies in tackling jailbreak attacks, providing valuable insights and research prospects to further advance the field of LLM security.

  • [1]
    Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30: Annual Conf on Neural Information Processing Systems 2017. New York: Curran Associates, 2017: 5998−6008
    [2]
    Bender E M, Gebru T, McMillan-Major A, et al. On the dangers of stochastic parrots: Can language models be too big?[C]//Proc of the 2021 ACM Conf on Fairness, Accountability, and Transparency. New York: ACM, 2021: 610−623
    [3]
    OpenAI. GPT-4 technical report[J]. arXiv preprint, arXiv: 2305.10403, 2023
    [4]
    Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 1−24
    [5]
    Anil R, Dai A M, Firat O, et al. PaLM 2 technical report[J]. arXiv preprint, arXiv: 2305.10403, 2023
    [6]
    Touvron H, Martin L, Stone K, et al. LLaMA 2: Open foundation and fine-tuned chat models[J]. arXiv preprint, arXiv: 2307.09288, 2023
    [7]
    Sun Yu, Wang Shuohuan, Feng Shikun, et al. ERNIE 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation[J]. arXiv preprint, arXiv: 2107.02137, 2021
    [8]
    Du Zhengxiao, Qian Yujie, Liu Xiao, et al. GLM: General language model pretraining with autoregressive blank infilling[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2022: 320−335
    [9]
    Ren Xiaozhe, Zhou Pingyi, Meng Xinfan, et al. PanGu-Σ: Towards trillion parameter language model with sparse heterogeneous computing[J]. arXiv preprint, arXiv: 2303.10845, 2023
    [10]
    Bai Jinze, Bai Shuai, Yang Shusheng, et al. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond[J]. arXiv preprint, arXiv: 2308.12966, 2023
    [11]
    Bubeck S, Chandrasekaran V, Eldan R, et al. Sparks of artificial general intelligence: Early experiments with GPT-4[J]. arXiv preprint, arXiv: 2303.12712, 2023
    [12]
    Tamkin A, Brundage M, Clark J, et al. Understanding the capabilities, limitations, and societal impact of large language models[J]. arXiv preprint, arXiv: 2102.02503, 2021
    [13]
    Bommasani R, Hudson D A, Adeli E, et al. On the opportunities and risks of foundation models[J]. arXiv preprint, arXiv: 2108.07258, 2021
    [14]
    Weidinger L, Mellor J, Rauh M, et al. Ethical and social risks of harm from language models[J]. arXiv preprint, arXiv: 2112.04359, 2021
    [15]
    Lin S, Hilton J, Evans O. TruthfulQA: Measuring how models mimic human falsehoods[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2022: 3214−3252
    [16]
    Pal A, Umapathi L K, Sankarasubbu M. Med-HALT: Medical domain Hallucination test for large language models[C]//Proc of the 27th Conf on Computational Natural Language Learning. Stroudsburg, PA: ACL, 2023: 314−334
    [17]
    Wei J, Bosma M, Zhao V Y, et al. Finetuned language models are zero-shot learners[C]//Proc of the 10th Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2022: 1−46
    [18]
    Christiano P F, Leike J, Brown T B, et al. Deep reinforcement learning from human preferences[C]//Advances in Neural Information Processing Systems 30: Annual Conf on Neural Information Processing Systems 2017. New York: Curran Associates, 2017: 4299−4307
    [19]
    Ziegler D M, Stiennon N, Wu J, et al. Fine-Tuning language models from human preferences[J]. arXiv preprint, arXiv: 1909.08593, 2019
    [20]
    Yao Jing, Yi Xiaoyuan, Wang Xiting, et al. From instructions to intrinsic human values-A survey of alignment goals for big models[J]. arXiv preprint, arXiv: 2308.12014, 2023
    [21]
    Wei A, Haghtalab N, Steinhardt J. Jailbroken: How does LLM safety training fail?[J]. arXiv preprint, arXiv: 2307.02483, 2023
    [22]
    Liu Yi, Deng Gelei, Xu Zhengzi, et al. Jailbreaking ChatGPT via prompt engineering: An empirical study[J]. arXiv preprint, arXiv: 2305.13860, 2023
    [23]
    Albert A. Jailbreak chat[EB/OL]. [2023-11-15]. https://www.jailbreakchat.com
    [24]
    Bai Yuntao, Kadavath S, Kundu S, et al. Constitutional AI: Harmlessness from AI feedback[J]. arXiv preprint, arXiv: 2212.08073, 2022
    [25]
    Wang Jindong, Hu Xixu, Hou Wenxin, et al. On the robustness of ChatGPT: An adversarial and out-of-distribution perspective[J]. arXiv preprint, arXiv: 2302.12095, 2023
    [26]
    Zhuo T Y, Li Zhuang, Huang Yujin, et al. On robustness of prompt-based semantic parsing with large pre-trained language model: An empirical study on codex[C]//Proc of the 17th Conf of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 1090−1102
    [27]
    McKenzie I R, Lyzhov A, Pieler M, et al. Inverse scaling: When bigger isn’t better[J]. arXiv preprint, arXiv: 2306.09479, 2023
    [28]
    Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional Transformers for language understanding[C]//Proc of the 2019 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2019: 4171−4186
    [29]
    Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text Transformer[J]. Machine Learning Research, 2020, 21: 140: 1−140: 67
    [30]
    Pauls A, Klein D. Faster and smaller n-gram language models[C]//Proc of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. Stroudsburg, PA: ACL, 2011: 258−267
    [31]
    Mikolov T, Karafiát M, Burget L, et al. Recurrent neural network based language model[C]//Proc of the 11th Annual Conf of the Int Speech Communication Association (Interspeech 2010). New York: ISCA, 2010: 1045−1048
    [32]
    Laurençon H, Saulnier L, Wang T, et al. The BigScience ROOTS Corpus: A 1.6TB composite multilingual dataset[C]//Advances in Neural Information Processing Systems: Vol. 35. New York: Curran Associates, 2022: 31809−31826
    [33]
    Yuan Sha, Zhao Hanyu, Du Zhengxiao, et al. WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models[J]. AI Open, 2021, 2: 65−68 doi: 10.1016/j.aiopen.2021.06.001
    [34]
    Henighan T, Kaplan J, Katz M, et al. Scaling laws for autoregressive generative modeling[J]. arXiv preprint, arXiv: 2010.14701, 2020
    [35]
    Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems: Vol. 33. New York: Curran Associates, 2020: 1877−1901
    [36]
    Ouyang Long, Wu J, Jiang Xu, et al. Training language models to follow instructions with human feedback[C]//Advances in Neural Information Processing Systems: Vol. 35. New York: Curran Associates, 2022: 27730−27744
    [37]
    Wei J, Wang Xuezhi, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]//Advances in Neural Information Processing Systems: Vol. 35. New York: Curran Associates, 2022: 24824−24837
    [38]
    Vicuna Team. Vicuna: An open-source Chatbot impressing GPT-4 with 90% ChatGPT quality[EB/OL]. [2023-11-20]. https://lmsys.org/blog/2023-03-30-vicuna
    [39]
    Anthropic. Claude[EB/OL]. [2023-11-20].https://claude.ai
    [40]
    Shayegani E, Dong Yue, Abu-Ghazaleh N. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models[J]. arXiv preprint, arXiv: 2307.14539, 2023
    [41]
    WitchBOT. You can use GPT-4 to create prompt injections against GPT-4[EB/OL]. [2023-11-22]. https://www.lesswrong.com/posts/bNCDexejSZpkuu3yz/you-can-use-gpt-4-to-create-prompt-injections-against-gpt-4.
    [42]
    Bai Yuntao, Jones A, Ndousse K, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback[J]. arXiv preprint, arXiv: 2204.05862, 2022
    [43]
    Abdelnabi S, Greshake K, Mishra S, et al. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection[C]//Proc of the 16th ACM Workshop on Artificial Intelligence and Security. New York: ACM, 2023: 79−90
    [44]
    Shayegani E, Mamun M A A, Fu Yu, et al. Survey of vulnerabilities in large language models revealed by adversarial attacks[J]. arXiv preprint, arXiv: 2310.10844, 2023
    [45]
    Wolf Y, Wies N, Avnery O, et al. Fundamental limitations of alignment in large language models[J]. arXiv preprint, arXiv: 2304.11082, 2023
    [46]
    Zou A, Wang Zifan, Kolter J Z, et al. Universal and transferable adversarial attacks on aligned language models[J]. arXiv preprint, arXiv: 2307.15043, 2023
    [47]
    Li Xuan, Zhou Zhanke, Zhu Jianing, et al. DeepInception: Hypnotize large language model to be jailbreaker[J]. arXiv preprint, arXiv: 2311.03191, 2023
    [48]
    Wei Zeming, Wang Yifei, Wang Yisen. Jailbreak and guard aligned language models with only few in-context demonstrations[J]. arXiv preprint, arXiv: 2310.06387, 2023
    [49]
    Li Haoran, Guo Dadi, Fan Wei, et al. Multi-step jailbreaking privacy attacks on ChatGPT[J]. arXiv preprint, arXiv: 2304.05197, 2023
    [50]
    Huang Yangsibo, Gupta S, Xia Mengzhou, et al. Catastrophic jailbreak of open-source LLMs via exploiting generation[J]. arXiv preprint, arXiv: 2310.06987, 2023
    [51]
    Yong Z X, Menghini C, Bach S H. Low-resource languages jailbreak GPT-4[J]. arXiv preprint, arXiv: 2310.02446, 2023
    [52]
    Yuan Youliang, Jiao Wenxiang, Wang Wenxuan, et al. GPT-4 is too smart to be safe: Stealthy chat with LLMs via cipher[J]. arXiv preprint, arXiv: 2308.06463, 2023
    [53]
    Chao P, Robey A, Dobriban E, et al. Jailbreaking black box large language models in twenty queries[J]. arXiv preprint, arXiv: 2310.08419, 2023
    [54]
    Shah R, Feuillade--Montixi Q, Pour S, et al. Scalable and transferable black-box jailbreaks for language models via persona modulation[J]. arXiv preprint, arXiv: 2311.03348, 2023
    [55]
    Zeng Yi, Lin Hongpeng, Zhang Jingwen, et al. How Johnny can persuade LLMs to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing LLMs[J]. arXiv preprint, arXiv: 2401.06373, 2024
    [56]
    Yao Dongyu, Zhang Jianshu, Harris I G, et al. FuzzLLM: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models[J]. arXiv preprint, arXiv: 2309.05274, 2023
    [57]
    Wang Yizhong, Kordi Y, Mishra S, et al. Self-Instruct: Aligning language models with self-generated instructions[J]. arXiv preprint, arXiv: 2212.10560, 2022
    [58]
    Yu Jiahao, Lin Xingwei, Xing Xinyu, et al. GPTFUZZER: Red teaming large language models with auto-generated jailbreak prompts[J]. arXiv preprint, arXiv: 2309.10253
    [59]
    Coulom R. Efficient selectivity and backup operators in Monte-Carlo tree search[C]//Proc of the 5th Int Conf on Computers and Games. Berlin: Springer, 2006: 72−83
    [60]
    Deng Gelei, Liu Yi, Li Yuekang, et al. MasterKey: Automated jailbreak across multiple large language model Chatbots[J]. arXiv preprint, arXiv: 2307.08715, 2023
    [61]
    Microsoft. Bing Search[EB/OL]. [2023-11-10]. https://www.bing.com/
    [62]
    Google. ‎Google Bard[EB/OL]. [2023-11-22]. https://bard.google.com
    [63]
    Szegedy C, Zaremba W, Sutskever I, et al. Intriguing properties of neural networks[C]//Proc of the 2nd Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2014: 1−10
    [64]
    Biggio B, Corona I, Maiorca D, et al. Evasion attacks against machine learning at test time[C]//Proc of European Conf on Machine Learning and Knowledge Discovery in Databases. Berlin: Springer, 2013: 387−402
    [65]
    Papernot N, McDaniel P, Jha S, et al. The limitations of deep learning in adversarial settings[C]// Proc of 2016 IEEE European Symp on Security and Privacy. Piscataway, NJ: IEEE, 2016: 372−387
    [66]
    Carlini N, Wagner D. Towards evaluating the robustness of neural networks[C]//Proc of 2017 IEEE Symp on Security and Privacy. Piscataway, NJ: IEEE, 2017: 39−57
    [67]
    Jia R, Liang P. Adversarial examples for evaluating reading comprehension systems[C]//Proc of the 2017 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2017: 2021−2031
    [68]
    Wallace E, Feng Shi, Kandpal N, et al. Universal adversarial triggers for attacking and analyzing NLP[C]//Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing. Stroudsburg, PA: ACL, 2019: 2153−2162
    [69]
    Ebrahimi J, Rao A, Lowd D, et al. HotFlip: White-Box adversarial examples for text classification[C]//Proc of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2018: 31−36
    [70]
    Shao Zhihong, Wu Zhongqin, Huang Minlie. AdvExpander: Generating natural language adversarial examples by expanding text[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 30: 1184−1196
    [71]
    Madry A, Makelov A, Schmidt L, et al. Towards deep learning models resistant to adversarial attacks[C]// Proc of the 6th Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2018: 1−28
    [72]
    Ilyas A, Santurkar S, Tsipras D, et al. Adversarial examples are not bugs, they are features[C]//Advances in Neural Information Processing Systems 32: Annual Conf on Neural Information Processing Systems 2019. New York: Curran Associates, 2019: 125−136
    [73]
    Zhou Chunting, Sun Chonglin, Liu Zhiyuan, et al. A C-LSTM neural network for text classification[J]. arXiv preprint, arXiv: 1511.08630, 2015
    [74]
    Mehrabi N, Beirami A, Morstatter F, et al. Robust conversational agents against imperceptible toxicity triggers[C]//Proc of the 2022 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2022: 2831−2847
    [75]
    Zhang Yizhe, Sun Siqi, Galley M, et al. DialoGPT : Large-scale generative pre-training for conversational response generation[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Stroudsburg, PA: ACL, 2020: 270−278
    [76]
    Shin T, Razeghi Y, Robert L, et al. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts[C]//Proc of the 2020 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 4222−4235
    [77]
    Liu Yinhan, Ott M, Goyal N, et al. RoBERTa: A robustly optimized bert pretraining approach[J]. arXiv preprint, arXiv: 1907.11692, 2019
    [78]
    Guo Chuan, Sablayrolles A, Jégou H, et al. Gradient-based adversarial attacks against text Transformers[C]//Proc of the 2021 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 5747−5757
    [79]
    Jang E, Gu Shixiang, Poole B. Categorical reparameterization with Gumbel-Softmax[C]// Proc of the 5th Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2017: 1−13
    [80]
    Carlini N, Nasr M, Choquette-Choo C A, et al. Are aligned neural networks adversarially aligned?[J]. arXiv preprint, arXiv: 2306.15447, 2023
    [81]
    Jones E, Dragan A D, Raghunathan A, et al. Automatically auditing large language models via discrete optimization[C]// Proc of Int Conf on Machine Learning. New York: PMLR, 2023: 15307−15329
    [82]
    Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: Efficient finetuning of quantized LLMs[J]. arXiv preprint, arXiv: 2305.14314, 2023
    [83]
    Subhash V, Bialas A, Pan Weiwei, et al. Why do universal adversarial attacks work on large language models?: Geometry might be the answer[J]. arXiv preprint, arXiv: 2309.00254, 2023
    [84]
    Zhu Sicheng, Zhang Ruiyi, An Bang, et al. AutoDAN: Automatic and interpretable adversarial attacks on large language models[J]. arXiv preprint, arXiv: 2310.15140, 2023
    [85]
    Alon G, Kamfonas M. Detecting language model attacks with perplexity[J]. arXiv preprint, arXiv: 2308.14132, 2023
    [86]
    Jain N, Schwarzschild A, Wen Yuxin, et al. Baseline defenses for adversarial attacks against aligned language models[J]. arXiv preprint, arXiv: 2309.00614, 2023
    [87]
    Lapid R, Langberg R, Sipper M. Open Sesame! Universal black box jailbreaking of large language models[J]. arXiv preprint, arXiv: 2309.01446, 2023
    [88]
    Liu Xiaogeng, Xu Nan, Chen Muhao, et al. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models[J]. arXiv preprint, arXiv: 2310.04451, 2023
    [89]
    Zhang Mi, Pan Xudong, Yang Min. JADE: A linguistics-based safety evaluation platform for large language models[J]. arXiv preprint, arXiv: 2311.00286, 2023
    [90]
    Zhou Chunting, Liu Pengfei, Xu Puxin, et al. LIMA: Less is more for alignment[J]. arXiv preprint, arXiv: 2305.11206, 2023
    [91]
    Marchant A, Hawton K, Stewart A, et al. A systematic review of the relationship between internet use, self-harm and suicidal behaviour in young people: The good, the bad and the unknown[J]. PLOS ONE, 2017, 12(8): 1−26
    [92]
    Sobkowicz P, Sobkowicz A. Dynamics of hate based Internet user networks[J]. The European Physical Journal B, 2010, 73(4): 633−643 doi: 10.1140/epjb/e2010-00039-0
    [93]
    Boxell L, Gentzkow M, Shapiro J M. Is the Internet causing political polarization? Evidence from demographics: 23258[R]. New York: National Bureau of Economic Research, 2017
    [94]
    Akyürek E, Bolukbasi T, Liu F, et al. Towards tracing knowledge in language models back to the training data[C]//Findings of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 2429−2446
    [95]
    Gardent C, Shimorina A, Narayan S, et al. Creating training corpora for NLG micro-planners[C]//Proc of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2017: 179−188
    [96]
    Wang Hongmin. Revisiting challenges in data-to-text generation with fact grounding[C]//Proc of the 12th Int Conf on Natural Language Generation. Stroudsburg, PA: ACL, 2019: 311−322
    [97]
    Parikh A, Wang Xuezhi, Gehrmann S, et al. ToTTo: A controlled table-to-text generation dataset[C]//Proc of the 2020 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 1173−1186
    [98]
    Deng Jiawen, Sun Hao, Zhang Zhexin, et al. Recent advances towards safe, responsible, and moral dialogue systems: A survey[J]. arXiv preprint, arXiv: 2302.09270, 2023
    [99]
    Dinan E, Humeau S, Chintagunta B, et al. Build it break it fix it for dialogue safety: Robustness from adversarial human attack[C]//Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing. Stroudsburg, PA: ACL, 2019: 4537−4546
    [100]
    Penedo G, Malartic Q, Hesslow D, et al. The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only[J]. arXiv preprint, arXiv: 2306.01116, 2023
    [101]
    Wang Yida, Ke Pei, Zheng Yinhe, et al. A large-scale Chinese short-text conversation dataset[C]//Proc of the 9th CCF Int Conf on Natural Language Processing and Chinese Computing. Berlin: Springer, 2020: 91−103
    [102]
    Gu Yuxian, Wen Jiaxin, Sun Hao, et al. EVA2.0: Investigating open-domain Chinese dialogue systems with large-scale pre-training[J]. Machine Intelligence Research, 2023, 20: 207−219 doi: 10.1007/s11633-022-1387-3
    [103]
    Roller S, Dinan E, Goyal N, et al. Recipes for building an open-domain Chatbot[C]//Proc of the 16th Conf of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2021: 300−325
    [104]
    Baumgartner J, Zannettou S, Keegan B, et al. The Pushshift Reddit dataset[J]. arXiv preprint, arXiv: 2001.08435, 2020
    [105]
    Chung H W, Hou Le, Longpre S, et al. Scaling instruction-finetuned language models[J]. arXiv preprint, arXiv: 2210.11416, 2022
    [106]
    Taori R, Gulrajani I, Zhang Tianyi, et al. Stanford Alpaca: An instruction-following LLaMA model[EB/OL]. [2023-11-24]. https://github.com/tatsu-lab/stanford_alpaca.
    [107]
    Ji Jiaming, Liu Mickel, Dai Juntao, et al. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset[J]. arXiv preprint, arXiv: 2307.04657, 2023
    [108]
    Deng Yue, Zhang Wenxuan, Pan S J, et al. Multilingual jailbreak challenges in large language models[J]. arXiv preprint, arXiv: 2310.06474, 2023
    [109]
    Wang Zezhong, Yang Fangkai, Wang Lu, et al. Self-Guard: Empower the LLM to safeguard itself[J]. arXiv preprint, arXiv: 2310.15851, 2023
    [110]
    Zhang Zhexin, Yang Junxiao, Ke Pei, et al. Defending large language models against Jailbreaking attacks through goal prioritization[J]. arXiv preprint, arXiv: 2311.09096, 2023
    [111]
    Xie Yueqi, Yi Jingwei, Shao Jiawei, et al. Defending ChatGPT against jailbreak attack via self-reminders[J]. Nature Machine Intelligence, 2023, 5(12): 1486−1496
    [112]
    Perez F, Ribeiro I. Ignore previous prompt: Attack techniques for language models[J]. arXiv preprint, arXiv: 2211.09527, 2022
    [113]
    Li Yuhui, Wei Fangyun, Zhao Jinjing, et al. RAIN: Your language models can align themselves without finetuning[J]. arXiv preprint, arXiv: 2309.07124, 2023
    [114]
    Zhang Yuqi, Ding Liang, Zhang Lefei, et al. Intention analysis prompting makes large language models a good Jailbreak defender[J]. arXiv preprint, arXiv: 2401.06561, 2024
    [115]
    Jigsaw. Perspective API[EB/OL]. [2023-11-24]. https://www.perspectiveapi.com/
    [116]
    Markov T, Zhang Chong, Agarwal S, et al. A holistic approach to undesired content detection in the real world[C]//Proc of the AAAI Conf on Artificial Intelligence. Menlo Park, CA: AAAI, 2023, 37(12): 15009−15018
    [117]
    Kumar A, Agarwal C, Srinivas S, et al. Certifying LLM safety against adversarial prompting[J]. arXiv preprint, arXiv: 2309.02705, 2023
    [118]
    Cao Bochuan, Cao Yuanpu, Lin Lu, et al. Defending against alignment-breaking attacks via robustly aligned LLM[J]. arXiv preprint, arXiv: 2309.14348, 2023
    [119]
    Meng Dongyu, Chen Hao. Magnet: A two-pronged defense against adversarial examples[C]//Proc of the 2017 ACM SIGSAC Conf on Computer and Communications Security. New York: ACM, 2017: 135−147
    [120]
    Robey A, Wong E, Hassani H, et al. SmoothLLM: Defending large language models against jailbreaking attacks[J]. arXiv preprint, arXiv: 2310.03684, 2023
    [121]
    Zhu Deyao, Chen Jun, Shen Xiaoqian, et al. MiniGPT-4: Enhancing vision-language understanding with advanced large language models[J]. arXiv preprint, arXiv: 2304.10592, 2023
    [122]
    Liu Haotian, Li Chunyuan, Wu Qingyang, et al. Visual instruction tuning[J]. arXiv preprint, arXiv: 2304.08485, 2023
    [123]
    Wu Jian, Gaur Yashesh, Chen Zhuo, et al. On decoder-only architecture for speech-to-text and large language model integration[C]//Proc of 2023 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway, NJ: IEEE, 2023: 1−8
    [124]
    Maaz M, Rasheed H, Khan S, et al. Video-ChatGPT: Towards detailed video understanding via large vision and language models[J]. arXiv preprint, arXiv: 2306.05424, 2023
    [125]
    Sinitsin A, Plokhotnyuk V, Pyrkin D V, et al. Editable neural networks[C]// Proc of the 8th Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2020: 1−12
    [126]
    Lee N, Ping Wei, Xu Peng, et al. Factuality enhanced language models for open-ended text generation[C]//Advances in Neural Information Processing Systems. New York: Curran Associates, 2022: 34586−34599
    [127]
    Zhu Chen, Rawat A S, Zaheer M, et al. Modifying memories in transformer models[J]. arXiv preprint, arXiv: 2012.00363, 2020
    [128]
    Mitchell E, Lin C, Bosselut A, et al. Fast model editing at scale[C]//The Tenth Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2022: 1−21
    [129]
    Meng K, Bau D, Andonian A, et al. Locating and editing factual associations in GPT[J]. Advances in Neural Information Processing Systems, 2022, 35: 17359−17372
    [130]
    Pinter Y, Elhadad M. Emptying the ocean with a spoon: Should we edit models?[C]//Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg, PA: ACL, 2023: 15164−15172
    [131]
    Zou A, Phan L, Chen S, et al. Representation engineering: A top-down approach to AI transparency[J]. arXiv preprint, arXiv: 2310.01405, 2023
    [132]
    Li Tianlong, Zheng Xiaoqing, Huang Xuanjing. Open the Pandora’s Box of LLMs: Jailbreaking LLMs through representation engineering[J]. arXiv preprint, arXiv: 2401.06824, 2024
    [133]
    Huang Changran. The intelligent agent NLP-based customer service system[C]// Proc of 2021 2nd Int Conf on Artificial Intelligence in Electronics Engineering. New York: ACM, 2021: 41−50
    [134]
    Du Yilun, Li Shuang, Torralba A, et al. Improving factuality and reasoning in language models through multiagent debate[J]. arXiv preprint, arXiv: 2305.14325, 2023
    [135]
    Sadasivan V S, Kumar A, Balasubramanian S, et al. Can AI-generated text be reliably detected?[J]. arXiv preprint, arXiv: 2303.11156, 2023
    [136]
    Glukhov D, Shumailov I, Gal Y, et al. LLM censorship: A machine learning challenge or a computer security problem?[J]. arXiv preprint, arXiv: 2307.10719, 2023
    [137]
    Brcic M, Yampolskiy R V. Impossibility results in AI: A survey[J]. ACM Computing Surveys, 2024, 56(1): 8: 1−8: 24
  • Related Articles

    [1]Du Yuyue, Sun Ya’nan, Liu Wei. Petri Nets Based Recognition of Model Deviation Domains and Model Repair[J]. Journal of Computer Research and Development, 2016, 53(8): 1766-1780. DOI: 10.7544/issn1000-1239.2016.20160099
    [2]Zhu Jun, Guo Changguo, Wu Quanyuan. A Web Services Interaction Behavior-Environment Model Based on Generalized Stochastic Petri Nets[J]. Journal of Computer Research and Development, 2012, 49(11): 2450-2463.
    [3]Sun Cong, Tang Liyong, Chen Zhong, Ma Jianfeng. Secure Information Flow in Java by Optimized Reachability Analysis of Weighted Pushdown System[J]. Journal of Computer Research and Development, 2012, 49(5): 901-912.
    [4]Zhou Hang, Huang Zhiqiu, Zhu Yi, Xia Liang, Liu Linyuan. Real-Time Systems Contact Checking and Resolution Based on Time Petri Net[J]. Journal of Computer Research and Development, 2012, 49(2): 413-420.
    [5]Men Peng and Duan Zhenhua. Extension of Model Checking Tool of Colored Petri Nets and Its Applications in Web Service Composition[J]. Journal of Computer Research and Development, 2009, 46(8): 1294-1303.
    [6]Zhao Mingfeng, Song Wen, Yang Yixian. Confusion Detection Based on Petri-Net[J]. Journal of Computer Research and Development, 2008, 45(10): 1631-1637.
    [7]Cui Huanqing and Wu Zhehui. Structural Properties of Parallel Program's Petri Net Model[J]. Journal of Computer Research and Development, 2007, 44(12): 2130-2135.
    [8]Tang Da, Li Ye. Model Analysis of Supply Chain System Based on Color Stochastic Petri Net[J]. Journal of Computer Research and Development, 2007, 44(10): 1782-1789.
    [9]Lao Songyang, Huang Guanglian, Alan F. Smeaton, Gareth J. F. Jones, Hyowon Lee. A Query Description Model of Soccer Video Based on BSU Composite Petri-Net[J]. Journal of Computer Research and Development, 2006, 43(1): 159-168.
    [10]Li Botao and Luo Junzhou. Modeling and Analysis of Non-Repudiation Protocols by Using Petri Nets[J]. Journal of Computer Research and Development, 2005, 42(9): 1571-1577.
  • Cited by

    Periodical cited type(22)

    1. 黄蔚亮,李锦煊,余志文,蔡亚永,刘元. 确定性网络:架构、关键技术和应用. 重庆邮电大学学报(自然科学版). 2025(01): 1-16 .
    2. 姜旭艳,全巍,付文文,张小亮,孙志刚. OpenPlanner:一个开源的时间敏感网络流量规划器. 计算机研究与发展. 2025(05): 1307-1329 . 本站查看
    3. 齐玉玲,黄涛,张军贤,贾焱鑫,徐龙,熊伟,朱海龙,彭开来. 基于时间敏感网络的列车通信网络研究及应用. 城市轨道交通研究. 2024(05): 184-189 .
    4. 何倩,郭雅楠,赵宝康,潘琪,王勇. 无等待与时隙映射复用结合的时间触发流调度方法. 通信学报. 2024(08): 192-204 .
    5. 郭若彤,许方敏,张恒升,赵成林. 基于循环排队转发的时间触发流量路由与调度优化方法. 微电子学与计算机. 2024(10): 55-63 .
    6. 薛强,吴梦,杨世标,屠礼彪,李伟,廖江. 嵌入式人工智能技术在IP网络的创新应用. 邮电设计技术. 2024(10): 66-72 .
    7. 张浩,郭偶凡,周飞飞,马涛,何迎利,姚苏滨. 基于分段帧复制和消除的时间敏感网络动态冗余机制研究. 计算机科学. 2024(S2): 750-756 .
    8. 罗峰,周杰,王子通,张晓先,孙志鹏. 基于多域冗余的车载时间敏感网络时间同步增强方法. 系统工程与电子技术. 2024(12): 4259-4268 .
    9. 王雪荣,唐政治,李银川,齐美玉,朱建波,张亮. 基于优化决策树的时延敏感流智能感知调度. 电信科学. 2023(04): 120-132 .
    10. 陆以勤,谢文静,王海瀚,陈卓星,程喆,潘伟锵,覃健诚. 面向时间敏感网络的安全感知调度方法. 华南理工大学学报(自然科学版). 2023(05): 1-12 .
    11. 朱渊,胡馨予,吴思远,黄蓉. 基于OMNeT++的5G-TSN调度算法综述. 西安邮电大学学报. 2023(01): 9-18 .
    12. 王家兴,杨思锦,庄雷,宋玉,阳鑫宇. 时间敏感网络中多目标在线混合流量调度算法. 计算机科学. 2023(07): 286-292 .
    13. 李维,梁巍,周策. 基于ACO算法的SDN网络流量调度优化研究. 自动化与仪器仪表. 2023(07): 42-46 .
    14. 吴昭祥,李文凯,袁亚洲,刘志新. 时间敏感网络中基于抢占式通道模型的资源调度算法研究. 移动通信. 2023(08): 67-73+97 .
    15. 刘美鹭,刘留,王凯,韩紫杰. 真空管高速飞行列车通信业务建模. 移动通信. 2023(08): 98-106 .
    16. 彭紫梅,寿国础,郭梦杰,刘雅琼,胡怡红. 时间敏感网络中的冗余机制研究综述. 电信科学. 2023(08): 29-42 .
    17. 胡文学,孙雷,王健全,朱渊,毕紫航. 基于网络演算的时间敏感网络时延上界分析模型研究. 自动化学报. 2023(11): 2297-2310 .
    18. 王新蕾,周敏,张涛. 时间敏感网络流量调度算法研究综述. 电讯技术. 2023(11): 1830-1838 .
    19. 段晓东,刘鹏,陆璐,孙滔,李志强. 确定性网络技术综述. 电信科学. 2023(11): 1-12 .
    20. 陆以勤,熊欣,王猛,覃健诚,潘伟锵. TSN中基于链路负载均衡的AVB流量带宽分配方法. 华南理工大学学报(自然科学版). 2023(11): 1-9 .
    21. 周阳,陈鸿龙,张雷. 时间敏感网络中的动态路由与调度联合优化算法. 物联网学报. 2023(04): 52-62 .
    22. 裴金川,胡宇翔,田乐,胡涛,李子勇. 联合路由规划的时间敏感网络流量调度方法. 通信学报. 2022(12): 54-65 .

    Other cited types(36)

Catalog

    Article views (1681) PDF downloads (515) Cited by(58)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return