基于知识编辑的大模型内容生成安全分析

王梦如; 姚云志; 习泽坤; 张锦添; 王鹏; 徐子文; 张宁豫

doi:10.7544/issn1000-1239.202330965

基于知识编辑的大模型内容生成安全分析

Safety Analysis of Large Model Content Generation Based on Knowledge Editing

摘要

摘要: 大语言模型（large language models，LLMs）虽然取得了显著的成功，但在实际应用中依然面临着安全问题，容易在恶意诱导下生成有毒、有害内容. 目前缓解LLMs不安全行为的方法通常需要高昂的数据收集成本以及大量的计算资源. 大模型知识编辑可以在不重新训练模型的基础上，根据特定的输入精确地改变模型对应的输出，在节约大量资源的条件下约束模型的行为；为优化大模型生成安全内容提供了一个新的可行思路. 然而，目前学术界缺乏较为系统和全面的基于知识编辑的大模型内容安全生成分析数据集. 具体地说，当前缓解LLMs不安全行为的数据集并未包括所有的不安全场景，且其有毒问题几乎无法绕过对齐后的LLMs安全防线，因此无法缓解对齐后LLMs存在的不安全问题. 针对上述问题，设计了新的数据集SafeGen，并提出新的评价体系分析知识编辑在优化LLMs生成安全内容的潜力. 大量的实验发现知识编辑可以提高LLMs内部的安全信念，在校正LLMs不安全行为领域展现了广阔的应用前景. 但经过知识编辑的LLMs生成文本的流畅性却差强人意，这也表明了这项任务的潜在难度. 该工作可以为大模型安全社区提供一些见解.

Abstract: Although large language models (LLMs) have achieved remarkable success, they still face security problems in practical applications, and it is easy to generate toxic and harmful content under malicious induction. Existing methods to mitigate the unsafe behavior of LLMs often demand significant computational resources and incur high costs associated with secure data collection. Knowledge editing offers a novel approach to constrain the model’s behavior precisely for specific inputs without the need for retraining, thus saving substantial resources. This approach provides a new feasible avenue for optimizing large models to generate secure content. Nevertheless, existing datasets for mitigating the unsafe behavior of LLMs do not encompass all unsafe scenarios. Moreover, the toxicity issues in these datasets are nearly insurmountable for post-alignment LLMs’ security defenses, hindering the optimization of safety concerns in post-alignment LLMs. In light of these challenges, we introduce a new dataset called SafeGen and propose a novel evaluation framework to analyze the potential of knowledge editing in optimizing the generation of secure content by LLMs. Extensive experiments reveal that knowledge editing demonstrates broad applications in rectifying unsafe behaviors exhibited by LLMs, and editing parameters can enhance the internal safety beliefs of LLMs. However, the fluency of text generated by knowledge editing falls short of expectations, indicating the inherent difficulty of this task. We hope that our work provides insights for the large model security community.

HTML全文

参考文献(45)

施引文献

资源附件(0)