Abstract:
Although large language models (LLMs) have achieved remarkable success, they still face security problems in practical applications, and it is easy to generate toxic and harmful content under malicious induction. Existing methods to mitigate the unsafe behavior of LLMs often demand significant computational resources and incur high costs associated with secure data collection. Knowledge editing offers a novel approach to constrain the model’s behavior precisely for specific inputs without the need for retraining, thus saving substantial resources. This approach provides a new feasible avenue for optimizing large models to generate secure content. Nevertheless, existing datasets for mitigating the unsafe behavior of LLMs do not encompass all unsafe scenarios. Moreover, the toxicity issues in these datasets are nearly insurmountable for post-alignment LLMs’ security defenses, hindering the optimization of safety concerns in post-alignment LLMs. In light of these challenges, we introduce a new dataset called SafeGen and propose a novel evaluation framework to analyze the potential of knowledge editing in optimizing the generation of secure content by LLMs. Extensive experiments reveal that knowledge editing demonstrates broad applications in rectifying unsafe behaviors exhibited by LLMs, and editing parameters can enhance the internal safety beliefs of LLMs. However, the fluency of text generated by knowledge editing falls short of expectations, indicating the inherent difficulty of this task. We hope that our work provides insights for the large model security community.