Abstract:
In the deep learning community, lots of efforts have been made to enhance the robustness and the reliability of deep neural networks (DNNs). Previous research mainly analyzed the fragility of DNN from the perspective of adversarial attack, and researchers designed numerous adversarial attack and defense methods. However, with the wide application of pre-trained models (PTMs), a new security threat against DNN especially PTM, called backdoor attack is emerging. Backdoor attack aims at injecting hidden backdoors into DNN, such that the backdoored model behaves properly on normal inputs but produces attacker-specified malicious outputs on the poisoned inputs embedded with special triggers. Backdoor attack poses a severe threat against DNN based systems like spam filter or hate speech detector. Compared with the textual adversarial attack and defense which has been widely studied, textual backdoor attack and defense has not been thoroughly investigated and requires a systematic review. In this paper, we present a comprehensive survey of backdoor attack and defense methods in the text domain. Specifically, we first summarize and categorize the textual backdoor attack and defense methods from different perspectives, then we introduce typical work and analyze their pros and cons. We also enumerate widely adopted benchmark datasets and evaluation metrics in the current literatures. Moreover, we respectively compare the backdoor attack with two relevant threats (i.e., adversarial attack and data poisoning). Finally, we discuss existing challenges of backdoor attack and defense in the text domain and present several promising future directions in this emerging and rapidly growing research area.