Abstract:
Model hijacking attack is a novel attack method that implants specific words to covertly control a model, making it perform tasks different from its original purpose, increasing training costs and exposing the model owner to legal risks. While this attack has been recently studied for German-English models, it remains unexplored in the Chinese natural language processing (NLP) field. Compared to other languages, the unique characteristics of Chinese pose distinct security challenges, making existing attack methods suitable for German-English models not directly applicable to Chinese models. However, these risks posed by this attack can still be exploited by attackers, thereby threatening Chinese models. Therefore, it is crucial to develop an attack evaluation method for Chinese models. Based on these considerations, this paper proposes Cheater, a model hijacking attack method tailored for Chinese-English NLP tasks to evaluate the security of Chinese models. To successfully hijack the target model, Cheater first uses a public translation model to camouflage the hijacking data, generating a transitional dataset. It then embeds Chinese logical words into the transitional dataset to produce malicious data, which is used to hijack the target model. For the BARTlarge model, the experiment shows that Cheater achieves an attack success rate of 90.2% at a 0.5% data contamination rate.