Safety Analysis of Large Model Content Generation Based on Knowledge Editing

Wang Mengru; Yao Yunzhi; Xi Zekun; Zhang Jintian; Wang Peng; Xu Ziwen; Zhang Ningyu

doi:10.7544/issn1000-1239.202330965

Journal of Computer Research and Development > 2024 > 61(5): 1143-1155. > DOI: 10.7544/issn1000-1239.202330965 CSTR: 32373.14.issn1000-1239.202330965

Wang Mengru, Yao Yunzhi, Xi Zekun, Zhang Jintian, Wang Peng, Xu Ziwen, Zhang Ningyu. Safety Analysis of Large Model Content Generation Based on Knowledge Editing[J]. Journal of Computer Research and Development, 2024, 61(5): 1143-1155. DOI: 10.7544/issn1000-1239.202330965

Citation:

PDF (1684 KB)

Safety Analysis of Large Model Content Generation Based on Knowledge Editing

1.
School of Software Technology, Zhejiang University, Hangzhou 315048
2.
College of Computer Science and Technology, Zhejiang University, Hangzhou 310027

More Information

Author Bio:
Wang Mengru: born in 1996. PhD candidate. Member of CCF. Her main research interests include natural language processing, knowledge graph, and the safety of large language models

Yao Yunzhi: born in 2000. PhD candidate. Member of CCF. His main research interests include natural language processing, knowledge graph, and machine learning

Xi Zekun: born in 2001. Master. His main research interests include natural language processing and model editing

Zhang Jintian: born in 2001. Master candidate. His main research interest includes natural language processing

Wang Peng: born in 2001. Master. His main research interests include natural language processing, enhancement of large language models, and knowledge editing

Xu Ziwen: born in 2002. Undergraduate. His main research interest includes natural language processing

Zhang Ningyu: born in 1989. PhD, associate professor. Senior member of CCF. His main research interests include knowledge graph and natural language processing
Received Date: November 30, 2023
Revised Date: February 21, 2024
Available Online: March 13, 2024

Graphical Abstract

Abstract

Abstract

Although large language models (LLMs) have achieved remarkable success, they still face security problems in practical applications, and it is easy to generate toxic and harmful content under malicious induction. Existing methods to mitigate the unsafe behavior of LLMs often demand significant computational resources and incur high costs associated with secure data collection. Knowledge editing offers a novel approach to constrain the model’s behavior precisely for specific inputs without the need for retraining, thus saving substantial resources. This approach provides a new feasible avenue for optimizing large models to generate secure content. Nevertheless, existing datasets for mitigating the unsafe behavior of LLMs do not encompass all unsafe scenarios. Moreover, the toxicity issues in these datasets are nearly insurmountable for post-alignment LLMs’ security defenses, hindering the optimization of safety concerns in post-alignment LLMs. In light of these challenges, we introduce a new dataset called SafeGen and propose a novel evaluation framework to analyze the potential of knowledge editing in optimizing the generation of secure content by LLMs. Extensive experiments reveal that knowledge editing demonstrates broad applications in rectifying unsafe behaviors exhibited by LLMs, and editing parameters can enhance the internal safety beliefs of LLMs. However, the fluency of text generated by knowledge editing falls short of expectations, indicating the inherent difficulty of this task. We hope that our work provides insights for the large model security community.
- large language model,
- safety,
- knowledge editing,
- content generation,
- jailbreak prompt,
- defense,
- dataset

FullText(HTML)

References (45)

References

[1]	Huang Jie, Chang Kevin Chen-Chuan. Towards reasoning in large language models: A survey [C]// Proc of Findings of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 1049−1065
[2]	Mori M, MacDorman K, Kageki N. The uncanny valley from the field[J]. IEEE Robotics & Automation Magazine, 2012, 19(2): 98−100 doi: 10.1109/MRA.2012.2192811
[3]	Zhang Zhexin, Lei Leqi, Wu Lindong, et al. Safetybench: Evaluating the safety of large language models with multiple choice questions[J]. arXiv preprint, arXiv: 2309.07045, 2023
[4]	Sun Hao, Zhang Zhexin, Deng Jiawen, et al. Safety assessment of Chinese large language models[J]. arXiv preprint, arXiv: 2304.10436, 2023
[5]	Deshpande A, Murahari V, Rajpurohit T, et al. Toxicity in ChatGPT: Analyzing persona-assigned language models [J]. arXiv preprint, arXiv: 2304.05335, 2023
[6]	矣晓沅,谢幸. 大模型道德价值观对齐问题剖析[J]. 计算机研究与发展,2023,60(9):1926−1945 doi: 10.7544/issn1000-1239.202330553 Yi Xiaoyuan, Xie Xing. Unpacking the ethical value alignment in big models[J]. Journal of Computer Research and Development, 2023, 60(9): 1926−1945 (in Chinese) doi: 10.7544/issn1000-1239.202330553
[7]	Xi Zhihen, Chen Wenxiang, Guo Xin, et al. The rise and potential of large language model based agents: A survey[J]. arXiv preprint, arXiv: 2309.07864, 2023
[8]	Xu Guohai, Liu Jiay, Yan Ming, et al. Cvalues: Measuring the values of Chinese large language models from safety to responsibility[J]. arXiv preprint, arXiv: 2307.09705, 2023
[9]	Khalatbari L, Bang Yejin, Su Dan, et al. Learn What NOT to learn: Towards generative safety in Chatbots[J]. arXiv preprint, arXiv: 2304.11220, 2023
[10]	Balestriero R, Cosentino R, Shekkizhar S. Characterizing large language model geometry solves toxicity detection and generation[J]. arXiv preprint, arXiv: 2312.01648, 2023
[11]	Li Xingxuan, Li Yutong, Shafiq J, et al. Does GPT-3 demonstrate psychopathy? Evaluating large language models from a psychological perspective[J]. arXiv preprint, arXiv: 2212.10529, 2023
[12]	Lu Ximing, Sean W, Jack H, et al. Quark: Controllable text generation with reinforced unlearning[J]. Advances in Neural Information Processing Systems, 2022, 35: 27591−27609
[13]	Unanue I J, Parnell J, Piccardi M. BERTTune: Fine-tuning neural machine translation with BERTScore[J]. arXiv preprint, arXiv: 2106.02208, 2021
[14]	Scheurer J, Campos J A, Korbak T, et al. Training language models with language feedback at scale[J]. arXiv preprint, arXiv: 2303.16755, 2023
[15]	Wu Zeqiu, Hu Yushi, Shi Weijia, et al. Fine-grained human feedback gives better rewards for language model training[J]. arXiv preprint, arXiv: 2306.01693, 2023
[16]	Ouyang Long, Jeff W, Jiang Xu, et al. Training language models to follow instructions with human feedback[J]. Advances in Neural Information Processing Systems, 2022, 35: 27730−27744
[17]	Shen Wei, Zheng Rui, Zhan Wenyu, et al. Loose lips sink ships: Mitigating length bias in reinforcement earning from human feedback[J]. arXiv preprint, arXiv: 2310.05199, 2023
[18]	Yu Jiahao, Lin Xingwei, Yu Zheng, et al. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts[J]. arXiv preprint, arXiv: 2309.10253, 2023
[19]	Liu Yi, Deng Gelei, Xu Zhengzi, et al. Jailbreaking chatgpt via prompt engineering: An empirical study[J]. arXiv preprint, arXiv: 2305.13860, 2023
[20]	Akyurek A, Akyurek E, Kalyan A, et al. RL4F: Generating natural language feedback with reinforcement learning for repairing model outputs[J]. arXiv preprint, arXiv: 2305.08844, 2023
[21]	Huang Yangsibo, Gupta S, Xia Mengzhou, et al. Catastrophic jailbreak of open-source LLMs via exploiting generation[J]. arXiv preprint, arXiv: 2310.06987, 2023
[22]	Wen Jiaxin, Ke Pei, Sun Hao, et al. Unveiling the implicit toxicity in large language models[J]. arXiv preprint, arXiv: 2311.17391, 2023
[23]	Madaan A, Tandon N, Gupta P, et al. Self-refine: Iterative refinement with self-feedback[J]. arXiv preprint, arXiv: 2303.17651, 2023
[24]	Welleck S, Lu Ximing, West P, et al. Generating sequences by learning to self-correct[J]. arXiv preprint, arXiv: 2211.00053, 2022
[25]	Gandikota R, Materzynska J, Fiotto-Kaufman J, et al. Erasing concepts from diffusion models[C]// Proc of Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2023: 2426−2436
[26]	Yao Yunzhi, Wang Peng, Tian Bozhong, et al. Editing large language models: Problems, methods, and opportunities[C] // Empirical Methods in Natural Language Processing. Stroudsburg, PA: EMNLP, 2023: 10222−10240
[27]	Geva M, Caciularu A, Wang K R, et al. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space[C]// Empirical Methods in Natural Language Processing. Stroudsburg, PA: EMNLP, 2022: 30−45
[28]	Hu Xinshuo, Li Dongfang, Hu Baotian, et al. Separate the wheat from the chaff: Model deficiency unlearning via parameter-efficient module operation[J]. arXiv preprint, arXiv: 2308.08090, 2023
[29]	Zhang Yizhe, Galley M, Gao Jianfeng, et al. Generating informative and diverse conversational responses via adversarial information maximization[C]// Advances in Neural Information Processing Systems. La Jolla, CA: NEURAL INFORMATION PROCESSING SYSTEMS , 2018: 31−56
[30]	Heryanto Y, Triayudi A. Evaluating text quality of GPT engine davinci-003 and GPT engine davinci generation using BLEU score[J]. SAGA: Journal of Technology and Information System, 2023, 1(4): 121−129 doi: 10.58905/saga.v1i4.213
[31]	Tang Zecheng, Zhou Keyan, Wang Pinzheng, et al. Detoxify language model step-by-step[J]. arXiv preprint, arXiv: 2308.08295, 2023
[32]	Shu Manli, Wang Jiongxiao, Zhu Chen, et al. On the exploitability of instruction tuning[J]. arXiv preprint, arXiv: 2306.17194, 2023
[33]	Wu Xinwei, Li Junzhuo, Xu Minghui, et al. DEPN: Detecting and editing privacy neurons in pretrained language models[C]// Empirical Methods in Natural Language Processing. Stroudsburg, PA: EMNLP, 2023: 2875–2886
[34]	Ishibashi Y, Shimodaira H. Knowledge sanitization of large language models[J]. arXiv preprint, arXiv: 2309.11852, 2023
[35]	Meng K, Bau D, Andonian A, et al. Locating and editing factual associations in GPT[C]//Advances in Neural Information Processing Systems. New York: Curran Associates, 2022: 17359−17372
[36]	Mitchell E, Lin C, Bosselut A. Memory-based model editing at scale [C]//Proc of Int Conf on Machine Learning. New York: ACM, 2022: 15817−15831
[37]	Mitchell E, Lin C, Bosselut A, et al. Fast model editing at scale[J]. arXiv preprint, arXiv: 2110.11309, 2021
[38]	Zheng Ce, Li Lei, Dong Qingxiu, et al. Can we edit factual knowledge by in-context learning? [C]// Proc of Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: EMNLP, 2023: 4862−4876
[39]	Mei A, Levy S, Wang W Y. ASSERT: Automated safety scenario red teaming for evaluating the robustness of large language models [J]. arXiv preprint, arXiv: 2310.09624, 2023
[40]	Glaese A, McAleese N, Trębacz M, et al. Improving alignment of dialogue agents via targeted human judgements [J]. arXiv preprint, arXiv: 2209.14375, 2022
[41]	Wan D, Bansal M. FactPEGASUS: Factuality-aware pre-training and fine-tuning for abstractive summarization[C]// Proc of Conf of the North American Chapter of the Association for Computational Linguisti. Stroudsburg, PA : ACL, 2022: 1010−1028
[42]	Zhu Xinyu, Wang Junjie, Zhang Lin, et al. Solving math word problems via cooperative reasoning induced language models[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA : ACL, 2023: 4471−4485
[43]	Dathathri S, Madotto A, Lan J, et al. Plug and play language models: A simple approach to controlled text generation[J]. arXiv preprint, arXiv: 1912.02164, 2019
[44]	Huang Zeyu, Shen Yikang, Zhang Xiaofeng, et al. Transformer-patcher: One mistake worth one neuron[C]//Proc of the 11th Int Conf on Learning Representations. [2023-01-24]. https://arxiv.org/pdf/2301.09785.pdf
[45]	Meng K, Sharma A S, Andonian A, et al. Mass-editing memory in a transformer[C] //Proc of the 11th Int Conf on Learning Representations.[2023-08-01].https://arxiv.org/pdf/2210.07229.pdf.