基于中文逻辑词的模型劫持攻击方法

钟一; 陈珍珠; 付安民; 高艳松

doi:10.7544/issn1000-1239.202550192

基于中文逻辑词的模型劫持攻击方法

Model Hijacking Attack Method Based on Chinese Logical Words

摘要

摘要: 模型劫持攻击是一种新型攻击方式，通过植入特定词语，能够隐蔽地控制模型执行与原始任务截然不同的劫持任务，使模型拥有者的训练算力成本增加的同时面临潜在的法律风险. 目前，已有研究针对德-英文语言翻译模型探索了这一攻击方式，但在中文自然语言处理领域尚属空白. 中文语言的独特性使得其面临不同于其他语言环境的安全挑战，因此亟需开发针对中文模型的攻击评估方法. 基于上述事实，提出了一种基于中文逻辑词的模型劫持攻击方法Cheater，用于评估中文模型的安全性. Cheater针对中-英文NLP任务，首先使用公共模型对劫持数据进行伪装生成过渡数据，再通过在过渡样本中嵌入中文逻辑词的方式对其进行改造生成毒性数据，最后利用毒性数据完成对目标模型的劫持. 实验表明，对于Bartlarge模型，Cheater在0.5%的数据投毒率下攻击成功率可以达到90.2%.

Abstract: Model hijacking attack is a novel attack method that implants specific words to covertly control a model, making it perform tasks different from its original purpose, increasing training costs and exposing the model owner to legal risks. While this attack has been recently studied for German-English models, it remains unexplored in the Chinese natural language processing (NLP) field. Compared to other languages, the unique characteristics of Chinese pose distinct security challenges, making existing attack methods suitable for German-English models not directly applicable to Chinese models. However, these risks posed by this attack can still be exploited by attackers, thereby threatening Chinese models. Therefore, it is crucial to develop an attack evaluation method for Chinese models. Based on these considerations, this paper proposes Cheater, a model hijacking attack method tailored for Chinese-English NLP tasks to evaluate the security of Chinese models. To successfully hijack the target model, Cheater first uses a public translation model to camouflage the hijacking data, generating a transitional dataset. It then embeds Chinese logical words into the transitional dataset to produce malicious data, which is used to hijack the target model. For the Bartlarge model, the experiment shows that Cheater achieves an attack success rate of 90.2% at a 0.5% data contamination rate.

HTML全文

参考文献(46)

施引文献

资源附件(0)