Citation: | Zhang Mi, Pan Xudong, Yang Min. JADE-DB:A Universal Testing Benchmark for Large Language Model Safety Based on Targeted Mutation[J]. Journal of Computer Research and Development, 2024, 61(5): 1113-1127. DOI: 10.7544/issn1000-1239.202330959 |
We propose a universal safety testing benchmark for large language models (LLMs), JADE-DB. The benchmark is automatically constructed via the targeted mutation approach, which is able to convert test questions that are manually crafted by experienced LLM testers and multidisciplinary experts to highly threatening universal test questions. The converted questions still preserve the naturalness of human language without changing the core semantics of the original question, and in the meantime are able to consistently break over ten widely-used LLMs. Based on the incremental linguistic complexity, JADE-DB incorporates three levels of LLM safety testing, namely, basic, advanced and dangerous, which accounts for thousands of test questions covering 4 major unsafe generation categories, i.e., crime, tort, bias and core values, spanning over 30 unsafe topics. Specifically, we construct three dangerous safety benchmarks respectively for the three groups of LLMs, i.e., eight open-sourced Chinese, six commercial Chinese and four commercial English LLMs. The benchmarks simultaneously trigger harmful generation of multiple LLMs, with an average unsafe generation ratio of 70%. The results indicate that, due to the complexity of human language, most of the current best LLMs can hardly learn the infinite number of different syntactic structures of human language and thus recognize the invariant evil therein.
[1] |
OpenAI. Introducing ChatGPT[EB/OL]. (2022)[2022-11-30]. https://openai.com/blog/chatgpt
|
[2] |
Thorp H. ChatGPT is fun, but not an author[J]. Science, 2023, 379(6630): 313−313 doi: 10.1126/science.adg7879
|
[3] |
人民日报. 加快发展新一代人工智能[EB/OL]. 2023[2023-06-16]. http://opinion.people.com.cn/n1/2023/0616/c1003-40014529.html
People’s Daily. Accelerate the development of the next generation of artificial intelligence [EB/OL]. 2023[2023-06-16]. http://opinion.people.com.cn/n1/2023/0616/c1003-40014529.html (in Chinese)
|
[4] |
Touvron H, Martin L, Stone R, et al. LLaMa 2: Open foundation and fine-tuned chat models[J]. ArXiv, 2023, abs/2307.09288
|
[5] |
Du Zhengxiao, Qian Yujie, Liu Xiao, et al. GLM: General language model pretraining with autoregressive blank infilling[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2022: 320−335
|
[6] |
Sun Tianxiang, Zhang Xiaotian, He Zhengfu, et al. MOSS: Training conversational language models from synthetic data [EB/OL]. 2023 [2023-04-09].https://github.com/OpenLMLab/MOSS
|
[7] |
Henderson P, Krass S, Zheng L, et al. Pile of law: Learning responsible data filtering from the law and a 256GB open-source legal dataset [C/OL]// Advances in Neural Information Processing Systems 35: Annual Conf on Neural Information Processing Systems 2022. [2023-12-30]. http://papers.nips.cc/paper_files/paper/2022/hash/bc218a0c656e49d4b086975a9c785f47-Abstract-Datasets_and_Benchmarks.html
|
[8] |
Noy S, Zhang W. Experimental evidence on the productivity effects of generative artificial intelligence[J]. Science, 2023, 381(6654): 187−192 doi: 10.1126/science.adh2586
|
[9] |
Birch D G W. ChatGPT is a window into the real future of financial services[EB/OL]. [2023-06-25]. https://www.forbes.com/sites/davidbirch/2022/12/08/chatgpt-is-a-window-into-the-real-future-of-financial-services/
|
[10] |
Brown B, Mann B, Ryder N, et al. Language models are few-shot learners[C/OL]//Advances in Neural Information Processing Systems 33: Annual Conf on Neural Information Processing Systems 2020 (NeurIPS 2020). [2020-12-30]. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
|
[11] |
Carlini N, Tramer F, Wallace E, et al. Extracting training data from large language models [C]//Proc of 30th USENIX Security Symp (USENIX Security 2021). Berkeley, CA: USENIX Association, 2021: 2633−2650
|
[12] |
Bai Y, Kadavath S, Kundu S, et al. Constitutional AI: Harmlessness from AI feedback[J]. ArXiv, 2022, abs/2212.08073
|
[13] |
Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback [C/OL]// Advances in Neural Information Processing Systems 35: Annual Conf on Neural Information Processing Systems 2022. [2023-04-01]. http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html
|
[14] |
Gehman S, Gururangan S, Sap M, et al. RealToxicityPrompts: Evaluating neural toxic degeneration in language models[C/OL]// Findings of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 3356−3369
|
[15] |
Sun Hao, Zhang Zhexin, Deng Jiawen, et al. Safety assessment of Chinese large language models[J]. ArXiv, 2023, abs/2304.10436
|
[16] |
Xu Guohai, Liu Jiayi, Yan Ming, et al. CValues: Measuring the values of Chinese large language models from safety to responsibility[J]. ArXiv, 2023, abs/2307.09705
|
[17] |
Wang Yuxia, Li Haonan, Han Xudong, et al. Do-Not-Answer: A dataset for evaluating safeguards in LLMs[J]. ArXiv, 2023, abs/2308.13387
|
[18] |
Perez E, Huang S, Song F, et al. Red teaming language models with language models[J]. ArXiv, 2022, abs/2202.03286
|
[19] |
Liu Yi, Deng Gelei, Xu Zhengzi, et al. Jailbreaking ChatGPT via prompt engineering: An empirical study[J]. ArXiv, 2023, abs/2305.13860
|
[20] |
Zou A, Wang Z, Kolter Z, et al. Universal and transferable adversarial attacks on aligned language models[J]. ArXiv, 2023, abs/2307.15043
|
[21] |
Chomsky N. Syntactic Structures[M]. Berlin: Mouton de Gruyter, 2002
|
[22] |
Chomsky N. Language and Problems of Knowledge[M]. Cambridge, MA: MIT Press, 1987
|
[23] |
Chomsky. The False Promise of ChatGPT [EB/OL]. The New York Times. [2023-06-12].https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chatgpt-ai.html
|
[24] |
Dinan E, Humeau S, Chintagunta B, et al. Build it break it fix it for dialogue safety: Robustness from adversarial human attack[C]// Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing. Stroudsburg, PA: ACL, 2019: 4537−4546
|
[25] |
Si W, Backes M, Blackburn J, et al. Why so toxic?: Measuring and triggering toxic behavior in open-domain chatbots[C]//Proc of the 2022 ACM SIGSAC Conf on Computer and Communications Security. New York: ACM, 2022: 2659−2673
|
[26] |
Casper S, Lin J, Kwon J, et al. Explore, establish, exploit: Red teaming language models from scratch[J]. ArXiv, 2023, abs/2306.09442
|
[27] |
Albert A. Jailbreak chat[EB/OL]. 2023[2023-06-20].https://www.jailbreakchat.com/
|
[28] |
Shen Xinyue, Chen Zeyuan, Backes M, et al. “Do Anything Now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models[J]. ArXiv, 2023, abs/2308.03825
|
[29] |
Deshpande A, Murahari V, Rajpurohit T, et al. Toxicity in ChatGPT: Analyzing persona-assigned language models[J]. ArXiv, 2023, abs/2304.05335
|
[30] |
Yu Jiahao, Lin Xingwei, Xing Xinyu. GPTFUZZER : Red teaming large language models with auto-generated jailbreak prompts[J]. ArXiv, 2023, abs/2309.10253
|
[31] |
Deng Gelei, Liu Yi, Li Yuekang, et al. Jailbreaker: Automated jailbreak across multiple large language model Chatbots[J]. ArXiv, 2023, abs/2307.08715
|
[32] |
Zhu Kaijie, Wang jindong, Zhou Jiaheng, et al. PromptBench: Towards evaluating the robustness of large language models on adversarial prompts[J]. ArXiv, 2023, abs/2306.04528
|
[33] |
Liu Yachuan, Chen Liang, Wang Jindong, et al. Meta semantic template for evaluation of large language models[J]. ArXiv, 2023, abs/2310.01448
|
[34] |
Zhang E, Sheng Q, Alhazmi A, et al. Adversarial attacks on deep-learning models in natural language processing: A survey[J]. ACM Transactions on Intelligent System Technology, 2020, 11(3): 24: 1−24: 41
|
[35] |
Chomsky N. Deep structure, surface structure and semantic interpretation[M]//Studies on Semantics in Generative Grammar. Berlin: De Gruyter Mouton, 1996: 62−119
|
[36] |
Szmrecsanyi B. On operationalizing syntactic complexity[C]// 7th Journees Internationales d’Analyse Statistique des Donnees Textuelle. Louvain-la-Neuve, Belgium: Presses Universitaires de Louvain, 2004: 1031−1038
|
[37] |
Lu X. The relationship of lexical richness to the quality of ESL learners’ oral narratives[J]. The Modern Language Journal, 2012, 96: 190−208
|
[38] |
Yngve H. A model and an hypothesis for language structure[R]. Cambridge, MA: MIT, Research Laboratory of Electronics, 1960
|
[39] |
Maddieson I. Issues of Phonological Complexity: Statistical Analysis of the Relationship Between Syllable Structures, Segment Inventories and Tone Contrasts[R]. Berkeley, CA: UC Berkeley Phonology Lab, 2005
|
[40] |
Cui Yue, Zhu Junhui, Yang Liner, et al. CTAP for Chinese: A linguistic complexity feature automatic calculation platform[C]//Proc of the Thirteenth Language Resources and Evaluation Conf. Marseille, France: European Language Resources Association, 2022: 5525−5538
|
[41] |
科技部新一代人工智能发展研究中心. 做好人工智能发展的风险防范[EB/OL].https://paper.cntheory.com/html/2023-10/23/nw.D110000xxsb_20231023_2-A7.htm
Ministry of Science and Technology’s Next Generation Artificial Intelligence Development Research Center. Ensuring effective risk prevention and control in the development of artificial intelligence [EB/OL].https://paper.cntheory.com/html/2023-10/23/nw.D110000xxsb_20231023_2-A7.htm (in Chinese)
|
[42] |
Lee H, Phatale S, Mansoor H, et al. RLAIF: Scaling reinforcement learning from human feedback with AI feedback[J]. ArXiv, 2023, abs/2309.00267
|
[43] |
Ji Z, Lee N, Frieske R, et al. Survey of Hallucination in natural language generation[J]. ACM Computing Surveys, 2022, 55: 1−38
|
[44] |
Talat Z, Blix H, Valvoda J, et al. A word on machine ethics: A response to Jiang et al. (2021)[J]. ArXiv, 2021, abs/2111.04158
|
[45] |
矣晓沅,谢幸. 大模型道德价值观对齐问题剖析[J]. 计算机研究与发展,2023,60(9):1926−1945 doi: 10.7544/issn1000-1239.202330553
Yi Xiaoyuan, Xie Xing. Unpacking the ethical value alignment in big models[J]. Journal of Computer Research and Development, 2023, 60(9): 1926−1945(in Chinese) doi: 10.7544/issn1000-1239.202330553
|
[46] |
Kitaev N, Cao S, Klein D. nikitakit/self-attentive-parser (GitHub Repository)[EB/OL]. 2018[2023-06-01].https://github.com/nikitakit/self-attentive-parser
|
[47] |
Kitaev N, KLEIN D. Constituency parsing with a self-attentive encoder[C]//Proc of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, 2018: 2676−2686
|
[48] |
Kitaev N, Cao S, Klein D. Multilingual constituency parsing with self-attention and pre-training[C]//Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 3499−3505
|
[49] |
Huang Yuzhen, Bai Yuzhuo, Zhu Zhihao, et al. C-EVAL LLM benchmark scoreboard[EB/OL]. 2023[2023-09-10]. https://cevalbenchmark.com/static/leaderboard.html
|
[50] |
Tatsu Lab. AlpacaEval LLM Benchmark Scoreboard[EB/OL]. 2023[2023-09-10].https://tatsu-lab.github.io/alpaca_eval/
|
[51] |
Pan Xudong, Zhang Mi, Sheng Beina, et al. Hidden trigger backdoor attack on NLP models via linguistic style manipulation[C]//Proc of 31st USENIX Security Symp (USENIX Security 2022). Berkeley, CA: USENIX Association: 2022: 3611−3628
|
[52] |
Marcus G, Leivada E, Murphy E. A sentence is worth a thousand pictures: Can large language models understand human language?[J]. Arxiv, 2023, abs/2308.00109
|
[53] |
Wang Y, Lee Y, Chen Y. Tree Transformer: Integrating tree structures into self-attention[C]// Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing. Stroudsburg, PA: ACL, 2019: 1061−1070
|
[54] |
Sartran L, Barrett S, Kuncoro A, et al. Transformer grammars: Augmenting Transformer language models with syntactic inductive biases at scale[J]. Transactions of the Association for Computational Linguistics, 2022, 10: 1423−1439 doi: 10.1162/tacl_a_00526
|
[1] | Li Nan, Ding Yidong, Jiang Haoyu, Niu Jiafei, Yi Ping. Jailbreak Attack for Large Language Models: A Survey[J]. Journal of Computer Research and Development, 2024, 61(5): 1156-1181. DOI: 10.7544/issn1000-1239.202330962 |
[2] | Wang Mengru, Yao Yunzhi, Xi Zekun, Zhang Jintian, Wang Peng, Xu Ziwen, Zhang Ningyu. Safety Analysis of Large Model Content Generation Based on Knowledge Editing[J]. Journal of Computer Research and Development, 2024, 61(5): 1143-1155. DOI: 10.7544/issn1000-1239.202330965 |
[3] | Chen Xuanting, Ye Junjie, Zu Can, Xu Nuo, Gui Tao, Zhang Qi. Robustness of GPT Large Language Models on Natural Language Processing Tasks[J]. Journal of Computer Research and Development, 2024, 61(5): 1128-1142. DOI: 10.7544/issn1000-1239.202330801 |
[4] | Chen Huimin, Liu Zhiyuan, Sun Maosong. The Social Opportunities and Challenges in the Era of Large Language Models[J]. Journal of Computer Research and Development, 2024, 61(5): 1094-1103. DOI: 10.7544/issn1000-1239.202330700 |
[5] | Yang Yi, Li Ying, Chen Kai. Vulnerability Detection Methods Based on Natural Language Processing[J]. Journal of Computer Research and Development, 2022, 59(12): 2649-2666. DOI: 10.7544/issn1000-1239.20210627 |
[6] | Pan Xuan, Xu Sihan, Cai Xiangrui, Wen Yanlong, Yuan Xiaojie. Survey on Deep Learning Based Natural Language Interface to Database[J]. Journal of Computer Research and Development, 2021, 58(9): 1925-1950. DOI: 10.7544/issn1000-1239.2021.20200209 |
[7] | Zheng Haibin, Chen Jinyin, Zhang Yan, Zhang Xuhong, Ge Chunpeng, Liu Zhe, Ouyang Yike, Ji Shouling. Survey of Adversarial Attack, Defense and Robustness Analysis for Natural Language Processing[J]. Journal of Computer Research and Development, 2021, 58(8): 1727-1750. DOI: 10.7544/issn1000-1239.2021.20210304 |
[8] | Pan Xudong, Zhang Mi, Yan Yifan, Lu Yifan, Yang Min. Evaluating Privacy Risks of Deep Learning Based General-Purpose Language Models[J]. Journal of Computer Research and Development, 2021, 58(5): 1092-1105. DOI: 10.7544/issn1000-1239.2021.20200908 |
[9] | Bao Yang, Yang Zhibin, Yang Yongqiang, Xie Jian, Zhou Yong, Yue Tao, Huang Zhiqiu, Guo Peng. An Automated Approach to Generate SysML Models from Restricted Natural Language Requirements in Chinese[J]. Journal of Computer Research and Development, 2021, 58(4): 706-730. DOI: 10.7544/issn1000-1239.2021.20200757 |
[10] | Che Haiyan, Feng Tie, Zhang Jiachen, Chen Wei, and Li Dali. Automatic Knowledge Extraction from Chinese Natural Language Documents[J]. Journal of Computer Research and Development, 2013, 50(4): 834-842. |