Citation: | Jiang Yi, Yang Yong, Yin Jiali, Liu Xiaolei, Li Jiliang, Wang Wei, Tian Youliang, Wu Yingcai, Ji Shouling. A Survey on Security and Privacy Risks in Large Language Models[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440265 |
In recent years, Large Language Models (LLMs) have emerged as a critical branch of deep learning network technology, achieving a series of breakthrough accomplishments in the field of Natural Language Processing (NLP), and gaining widespread adoption. However, throughout their entire lifecycle, including pre-training, fine-tuning, and actual deployment, a variety of security threats and risks of privacy breaches have been discovered, drawing increasing attention from both the academic and industrial sectors. Navigating the development of the paradigm of using large language models to handle natural language processing tasks, as known as the pre-training and fine-tuning paradigm, the pre-training and prompt learning paradigm, and the pre-training and instruction-tuning paradigm, this article outline conventional security threats against large language models, specifically representative studies on the three types of traditional adversarial attacks (adversarial example attack, backdoor attack and poisoning attack). It then summarizes some of the novel security threats revealed by recent research, followed by a discussion on the privacy risks of large language models and the progress in their research. The content aids researchers and deployers of large language models in identifying, preventing, and mitigating these threats and risks during the model design, training, and application processes, while also achieving a balance between model performance, security, and privacy protection.
[1] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. arXiv preprint, arXiv: 1706.03762, 2023
|
[2] |
Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[EB/OL]. 2018[2024-05-23]. https://hayate-lab.com/wp-content/uploads/2023/05/43372bfa750340059ad87ac8e538c53b.pdf
|
[3] |
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint, arXiv: 1810.04805, 2019
|
[4] |
Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint, arXiv: 1301.3781, 2013
|
[5] |
Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation[C] //Proc of the 19th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2014: 1532−1543
|
[6] |
陈慧敏,刘知远,孙茂松. 大语言模型时代的社会机遇与挑战[J]. 计算机研究与发展,2024,61(5):1094−1103 doi: 10.7544/issn1000-1239.202330700
Chen Huimin, Liu Zhiyuan, Sun Maosong. The social opportunities and challenges in the era of large language models[J]. Journal of Computer Research and Development, 2024, 61(5): 1094−1103 (in Chinese) doi: 10.7544/issn1000-1239.202330700
|
[7] |
Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[J]. arXiv preprint, arXiv: 2005.14165, 2020
|
[8] |
Morris J X, Lifland E, Yoo J Y, et al. TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP[J]. arXiv preprint, arXiv: 2005.05909, 2020
|
[9] |
Zeng Guoyang, Qi Fanchao, Zhou Qianrui, et al. OpenAttack: An open-source textual adversarial attack toolkit[C] //Proc of the 59th Annual Meeting of the ACL and the 11th Int Joint Conf on Natural Language Processing: System Demonstrations. Stroudsburg, PA: ACL, 2021: 363−371
|
[10] |
Wang Boxin, Xu Chejian, Wang Shuohang, et al. Adversarial GLUE: A multi-task benchmark for robustness evaluation of language models[J]. arXiv preprint, arXiv: 2111.02840, 2021
|
[11] |
Papernot N, McDaniel P, Swami A, et al. Crafting adversarial input sequences for recurrent neural networks[C] //Proc of the 2016 IEEE Military Communications Conf. Piscataway, NJ: IEEE, 2016: 49−54
|
[12] |
Liang Bin, Li Hongcheng, Su Miaoqiang, et al. Deep text classification can be fooled[C] //Proc of the 27th Int Joint Conf on Artificial Intelligence. Stockholm, Sweden: International Joint Conference on Artificial Intelligence Organization, 2018: 4208−4215
|
[13] |
Ebrahimi J, Rao A, Lowd D, et al. HotFlip: White-box adversarial examples for text classification[J]. arXiv preprint, arXiv: 1712.06751, 2018
|
[14] |
Li Jinfeng, Ji Shouling, Du Tianyu, et al. TextBugger: Generating adversarial text against real-world applications[J]. arXiv preprint, arXiv: 1812.05271, 2018
|
[15] |
Behjati M, Moosavi-Dezfooli S M, Baghshah M S, et al. Universal adversarial attacks on text classifiers[C] // Proc of the 44th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 7345−7349
|
[16] |
Wallace E, Feng S, Kandpal N, et al. Universal adversarial triggers for attacking and analyzing NLP[J]. arXiv preprint, arXiv: 1908.07125, 2021
|
[17] |
Gao J, Lanchantin J, Soffa M L, et al. Black-box generation of adversarial text sequences to evade deep learning classifiers[J]. arXiv preprint, arXiv: 1801.04354, 2018
|
[18] |
Eger S, Şahin G G, Rücklé A, et al. Text processing like humans do: Visually attacking and shielding NLP systems[C] //Proc of the 18th Conf of the North American Chapter of the ACL: Human Language Technologies (Volume 1: Long and Short Papers). Stroudsburg, PA: ACL, 2019: 1634−1647
|
[19] |
Jin Di, Jin Zhijing, Zhou J, et al. Is BERT really robust? A strong baseline for natural language attack on text classification and entailment[J]. arXiv preprint, arXiv: 1907.11932, 2020
|
[20] |
Li Linyang, Ma Ruotian, Guo Qipeng, et al. BERT-ATTACK: Adversarial attack against BERT using BERT[J]. arXiv preprint, arXiv: 2004.09984, 2020
|
[21] |
Garg S, Ramakrishnan G. BAE: BERT-based adversarial examples for text classification[C] //Proc of the 25th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 6174−6181
|
[22] |
Li Dianqi, Zhang Yizhe, Peng Hao, et al. Contextualized perturbation for textual adversarial attack[J]. arXiv preprint, arXiv: 2009.07502, 2021
|
[23] |
Maheshwary R, Maheshwary S, Pudi V. Generating natural language attacks in a hard label black box setting[C] //Proc of the 35th AAAI Conf on Artificial Intelligence. Palo Alto, CA : AAAI, 2021: 13525−13533
|
[24] |
Ye Muchao, Miao Chenglin, Wang Ting, et al. TextHoaxer: Budgeted hard-label adversarial attacks on text[C] // Proc of the 36th AAAI Conf on Artificial Intelligence. Palo Alto, CA : AAAI, 2022: 3877−3884
|
[25] |
Xu Lei, Chen Yangyi, Cui Ganqu, et al. Exploring the universal vulnerability of prompt-based learning paradigm[J]. arXiv preprint, arXiv: 2204.05239, 2022
|
[26] |
Xu Lei, Chen Yangyi, Cui Ganqu, et al. A prompting-based approach for adversarial example generation and robustness enhancement[J]. arXiv preprint, arXiv: 2203.10714, 2022
|
[27] |
Liu X,Zheng Y,Du Z,et al. GPT understands,too[J]. AI Open,2024,5:208-215(只有卷
|
[28] |
Lu Yao, Bartolo M, Moore A, et al. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity[J]. arXiv preprint, arXiv: 2104.08786, 2022
|
[29] |
Wang Jiongxiao, Liu Zichen, Park K H, et al. Adversarial demonstration attacks on large language models[J]. arXiv preprint, arXiv: 2305.14950, 2023
|
[30] |
Qiang Yao, Zhou Xiangyu, Zhu Dongxiao. Hijacking large language models via adversarial in-context learning[J]. arXiv preprint, arXiv: 2311.09948, 2023
|
[31] |
Shin T, Razeghi Y, Logan IV R L, et al. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts[C] //Proc of the 25th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 4222−4235
|
[32] |
Shi Yundi, Li Piji, Yin Changchun, et al. PromptAttack: Prompt-based attack for language models via gradient search[J]. arXiv preprint, arXiv: 2209.01882, 2022
|
[33] |
李南,丁益东,江浩宇,等. 面向大语言模型的越狱攻击综述[J]. 计算机研究与发展,2024,61(5):1156−1181 doi: 10.7544/issn1000-1239.202330962
Li Nan, Ding Yidong, Jiang Haoyu, et al. Jailbreak attack for large language models: A survey[J]. Journal of Computer Research and Development, 2024, 61(5): 1156−1181 (in Chinese) doi: 10.7544/issn1000-1239.202330962
|
[34] |
Perez F, Ribeiro I. Ignore previous prompt: Attack techniques for language models[J]. arXiv preprint, arXiv: 2211.09527, 2022
|
[35] |
Shen Xinyue, Chen Zeyuan, Backes M, et al. “Do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models[J]. arXiv preprint, arXiv: 2308.03825, 2023
|
[36] |
Gu Xiangming, Zheng Xiaosen, Pang Tianyu, et al. Agent Smith: A single image can jailbreak one million multimodal LLM agents exponentially fast[J]. arXiv preprint, arXiv: 2402.08567, 2024
|
[37] |
Zhan Qiusi, Liang Zhixiang, Ying Zifan, et al. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents[J]. arXiv preprint arXiv: 2403.02691, 2024
|
[38] |
Wei A, Haghtalab N, Steinhardt J. Jailbroken: How does LLM safety training fail?[J]. arXiv preprint, arXiv: 2307.02483, 2023
|
[39] |
Liu Yi, Deng Gelei, Li Yuekang, et al. Prompt injection attack against LLM-integrated applications[J]. arXiv preprint, arXiv: 2306.05499, 2023
|
[40] |
Abdelnabi S, Greshake K, Mishra S, et al. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection[C] //Proc of the 16th ACM Workshop on Artificial Intelligence and Security. New York: ACM, 2023: 79−90
|
[41] |
Zou A, Wang Zifan, Kolter J Z, et al. Universal and transferable adversarial attacks on aligned language models[J]. arXiv preprint, arXiv: 2307.15043, 2023
|
[42] |
Zhu Sicheng, Zhang Ruiyi, An Bang, et al. AutoDAN: Interpretable gradient-based adversarial attacks on large language models[J]. arXiv preprint, arXiv: 2310.15140, 2023
|
[43] |
Dai Jiazhu, Chen Chuanshuai, Li Yufeng. A backdoor attack against LSTM-based text classification systems[J]. IEEE Access, 2019, 7: 138872−138878 doi: 10.1109/ACCESS.2019.2941376
|
[44] |
Yang Wenkai, Lin Yankai, Li Peng, et al. Rethinking stealthiness of backdoor attack against NLP models[C] //Proc of the 59th Annual Meeting of the ACL and the 11th Int Joint Conf on Natural Language Processing (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2021: 5543−5557
|
[45] |
Kwon H, Lee S. Textual backdoor attack for the text classification system[J]. Security and Communication Networks, 2021, 2021(1): 1−11
|
[46] |
Kurita K, Michel P, Neubig G. Weight poisoning attacks on pre-trained models[J]. arXiv preprint, arXiv: 2004.06660, 2020
|
[47] |
Chen Xiaoyi, Salem A, Chen Dingfan, et al. BadNL: Backdoor attacks against NLP models with semantic-preserving improvements[C] //Proc of the 37th Annual Computer Security Applications Conf. New York: ACM, 2021: 554−569
|
[48] |
Lu Hengyang, Fan Chenyou, Yang Jun, et al. Where to attack: A dynamic locator model for backdoor attack in text classifications[C] //Proc of the 29th Int Conf on Computational Linguistics. Gyeongju, Republic of Korea: International Committee on Computational Linguistics, 2022: 984−993
|
[49] |
Qi Fanchao, Li Mukai, Chen Yangyi, et al. Hidden Killer: Invisible textual backdoor attacks with syntactic trigger[J]. arXiv preprint, arXiv: 2105.12400, 2021
|
[50] |
Qi Fanchao, Chen Yangyi, Zhang Xurui, et al. Mind the style of text! adversarial and backdoor attacks based on text style transfer[J]. arXiv preprint, arXiv: 2110.07139, 2021
|
[51] |
Pan Xudong, Zhang Mi, Sheng Beina, et al. Hidden trigger backdoor attack on NLP models via linguistic style manipulation [C]//Proc of the 31st USENIX Security Symp (USENIX Security 22). Berkeley, CA: USENIX Association, 2022: 3611−3628
|
[52] |
Kirkpatrick J, Pascanu R, Rabinowitz N, et al. Overcoming catastrophic forgetting in neural networks[J]. Proceedings of the National Academy of Sciences, 2017, 114(13): 3521−3526 doi: 10.1073/pnas.1611835114
|
[53] |
Li Linyang, Song Demin, Li Xiaonan, et al. Backdoor attacks on pre-trained models by layerwise weight poisoning[J]. arXiv preprint, arXiv: 2108.13888, 2021
|
[54] |
Yang Wenkai, Li Lei, Zhang Zhiyuan, et al. Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in NLP models[J]. arXiv preprint, arXiv: 2103.15543, 2021
|
[55] |
Merity S, Xiong C, Bradbury J, et al. Pointer sentinel mixture models[J]. arXiv preprint, arXiv: 1609.07843, 2016
|
[56] |
Zhang Xinyang, Zhang Zheng, Ji Shouling, et al. Trojaning language models for fun and profit[C] //Proc of the 6th IEEE European Symp on Security and Privacy (EuroS&P). Piscataway, NJ: IEEE, 2021: 179−197
|
[57] |
Chen Kangjie, Meng Yuxian, Sun Xiaofei, et al. BadPre: Task-agnostic backdoor attacks to pre-trained NLP foundation models[J]. arXiv preprint, arXiv: 2110.02467, 2021
|
[58] |
Shen Lujia, Ji Shouling, Zhang Xuhong, et al. Backdoor pre-trained models can transfer to all[C] //Proc of the 28th ACM SIGSAC Conf on Computer and Communications Security. New York: ACM, 2021: 3141−3158
|
[59] |
Zhang Zhengyan, Xiao Guangxuan, Li Yongwei, et al. Red alarm for pre-trained models: Universal vulnerability to neuron-level backdoor attacks[J]. Machine Intelligence Research, 2023, 20(2): 180−193 doi: 10.1007/s11633-022-1377-5
|
[60] |
Du Wei, Li Peixuan, Li Boqun, et al. UOR: Universal backdoor attacks on pre-trained language models[J]. arXiv preprint, arXiv: 2305.09574, 2023
|
[61] |
Du Wei, Zhao Yichun, Li Boqun, et al. PPT: Backdoor attacks on pre-trained models via poisoned prompt tuning[C] //Proc of the 31st Int Joint Conf on Artificial Intelligence. Palo Alto, CA: International Joint Conference on Artificial Intelligence Organization, 2022: 680−686
|
[62] |
Cai Xiangrui, Xu Haidong, Xu Sihan, et al. BadPrompt: Backdoor attacks on continuous prompts[J]. arXiv preprint, arXiv: 2211.14719, 2022
|
[63] |
Zhao Shuai, Wen Jinming, Tuan L A, et al. Prompt as triggers for backdoor attack: examining the vulnerability in language models[C] //Proc of the 28th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 12303−12317
|
[64] |
Mei Kai, Li Zheng, Wang Zhenting, et al. NOTABLE: Transferable backdoor attacks against prompt-based NLP models[J]. arXiv preprint, arXiv: 2305.17826, 2023
|
[65] |
Xu Jiashu, Ma M D, Wang Fei, et al. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models[J]. arXiv preprint, arXiv: 2305.14710, 2023
|
[66] |
Yan Jun, Yadav V, Li Shiyang, et al. Backdooring instruction-tuned large language models with virtual prompt injection[J]. arXiv preprint, arXiv: 2307.16888, 2024
|
[67] |
Rando J, Tramèr F. Universal jailbreak backdoors from poisoned human feedback[J]. arXiv preprint, arXiv: 2311.14455, 2024
|
[68] |
Yang Wenkai, Bi Xiaohan, Lin Yankai, et al. Watch out for your agents! Investigating backdoor threats to LLM-based agents[J]. arXiv preprint, arXiv: 2402.11208, 2024
|
[69] |
Wang Yifei, Xue Dizhan, Zhang Shengjie, et al. BadAgent: Inserting and activating backdoor attacks in LLM agents[J]. arXiv preprint, arXiv: 2406.03007, 2024
|
[70] |
Yao Yunzhi, Wang Peng, Tian Bozhong, et al. Editing large language models: problems, methods, and opportunities[C] //Proc of the 28th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 10222−10240
|
[71] |
Zhang Ningyu, Yao Yunzhi, Tian Bozhong, et al. A comprehensive study of knowledge editing for large language models[J]. arXiv preprint, arXiv: 2401.01286, 2024
|
[72] |
Li Yanzhou, Li Tianlin, Chen Kangjie, et al. BadEdit: Backdooring large language models by model editing[J]. arXiv preprint, arXiv: 2403.13355, 2024
|
[73] |
Qiu Jiyang, Ma Xinbei, Zhang Zhuosheng, et al. MEGen: Generative backdoor in large language models via model editing[J]. arXiv preprint, arXiv: 2408.10722, 2024
|
[74] |
Wang Hao, Guo Shangwei, He Jialing, et al. EvilEdit: Backdooring text-to-image diffusion models in one second[C] //Proc of the 32nd ACM Int Conf on Multimedia. New York: ACM, 2024: 3657−3665
|
[75] |
Barreno M, Nelson B, Sears R, et al. Can machine learning be secure?[C] //Proc of the 1st ACM Symp on Information, computer and communications security. New York: ACM, 2006: 16−25
|
[76] |
Wang Gang, Wang Tianyi, Zheng Haitao, et al. Man vs. Machine: Practical adversarial detection of malicious crowdsourcing workers[C]. //Proc of the 23rd USENIX Security Symp (USENIX Security 2014). Berkeley, CA: USENIX Association, 2014: 239−254
|
[77] |
Miao Chenglin, Li Qi, Xiao Houping, et al. Towards data poisoning attacks in crowd sensing systems[C] //Proc of the 18th ACM Int Symp on Mobile Ad Hoc Networking and Computing. New York: ACM, 2018: 111−120
|
[78] |
Feng Ji, Cai Qizhi, Zhou Zhihua. Learning to confuse: Generating training time adversarial data with auto-encoder[J]. arXiv preprint, arXiv: 1905.09027, 2019
|
[79] |
Shafahi A, Huang W R, Najibi M, et al. Poison frogs! Targeted clean-label poisoning attacks on neural networks[J]. arXiv preprint, arXiv: 1804.00792, 2018
|
[80] |
Zhu Chen, Huang W R, Li Hengduo, et al. Transferable clean-label poisoning attacks on deep neural nets[C] //Proc of the 36th Int Conf on Machine Learning. New York : PMLR, 2019: 7614−7623
|
[81] |
Wallace E, Zhao T Z, Feng S, et al. Concealed data poisoning attacks on NLP Models[J]. arXiv preprint, arXiv: 2010.12563, 2020
|
[82] |
Jagielski M, Severi G, Pousette Harger N, et al. Subpopulation data poisoning Attacks[C] //Proc of the 28th ACM SIGSAC Conf on Computer and Communications Security. New York: ACM, 2021: 3104−3122
|
[83] |
Schuster R, Song C, Tromer E, et al. You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion [C] //Proc of the 30th USENIX Security Symp (USENIX Security 21). Berkeley, CA: USENIX Association. 2021: 1559−1575
|
[84] |
He Pengfei, Xu Han, Xing Yue, et al. Data poisoning for in-context Learning[J]. arXiv preprint, arXiv: 2402.02160, 2024
|
[85] |
Hendel R, Geva M, Globerson A. In-context learning creates task vectors[C] // Proc of the 28th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 9318−9333
|
[86] |
Peng Baolin, Li Chunyuan, He Pengcheng, et al. Instruction tuning with GPT−4[J]. arXiv preprint, arXiv: 2304.03277, 2023
|
[87] |
Shu Manli, Wang Jiongxiao, Zhu Chen, et al. On the Exploitability of instruction tuning[J]. arXiv preprint, arXiv: 2306.17194, 2023
|
[88] |
Wan A, Wallace E, Shen S, et al. Poisoning language models during instruction tuning[J]. arXiv preprint, arXiv: 2305.00944, 2023
|
[89] |
Qiang Yao, Zhou Xiangyu, Zade S Z, et al. Learning to poison large language models during instruction tuning[J]. arXiv preprint, arXiv: 2402.13459, 2024
|
[90] |
Zou Wei, Geng Runpeng, Wang Binghui, et al. PoisonedRAG: Knowledge poisoning attacks to retrieval-augmented generation of large language models[J]. arXiv preprint, arXiv: 2402.07867, 2024
|
[91] |
Touvron H, Martin L, Stone K, et al. Llama 2: Open foundation and fine-tuned chat models[J]. arXiv preprint, arXiv: 2307.09288, 2023
|
[92] |
陈炫婷,叶俊杰,祖璨,等. GPT系列大语言模型在自然语言处理任务中的鲁棒性[J]. 计算机研究与发展,2024,61(5):1128−1142 doi: 10.7544/issn1000-1239.202330801
Chen Xuanting, Ye Junjie, Zu Can. Robustness of GPT large language models on natural language processing tasks[J]. Journal of Computer Research and Development, 2024, 61(5): 1128−1142 (in Chinese) doi: 10.7544/issn1000-1239.202330801
|
[93] |
Huang Lei, Yu Weijiang, Ma Weitao, et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions[J]. arXiv preprint, arXiv: 2311.05232, 2023
|
[94] |
Lin S, Hilton J, Evans O. TruthfulQA: Measuring how models mimic human falsehoods[J]. arXiv preprint, arXiv: 2109.07958, 2022
|
[95] |
Lee K, Ippolito D, Nystrom A, et al. Deduplicating training data makes language models better[J]. arXiv preprint, arXiv: 2107.06499, 2022
|
[96] |
Yu Fangyi, Quartey L, Schilder F. Legal prompting: Teaching a language model to think like a lawyer[J]. arXiv preprint, arXiv: 2212.01326, 2022
|
[97] |
Li Yunxiang, Li Zihan, Zhang Kai, et al. ChatDoctor: A medical chat model fine-tuned on a large language model meta-AI (LLaMA) using medical domain knowledge[J]. arXiv preprint, arXiv: 2303.14070, 2023
|
[98] |
Li Zuchao, Zhang Shitou, Zhao Hai, et al. BatGPT: A bidirectional autoregessive talker from generative pre-trained transformer[J]. arXiv preprint, arXiv: 2307.00360, 2023
|
[99] |
Wang Chaojun, Sennrich R. On exposure bias, hallucination and domain shift in neural machine translation[C] //Proc of the 58th Annual Meeting of the ACL. Stroudsburg, PA: ACL, 2020: 3544−3552
|
[100] |
Zhang Muru, Press O, Merrill W, et al. How language model hallucinations can snowball[J]. arXiv preprint, arXiv: 2305.13534, 2023
|
[101] |
Perez E, Ringer S, Lukošiūtė K, et al. Discovering language model behaviors with model-written evaluations[J]. arXiv preprint, arXiv: 2212.09251, 2022
|
[102] |
Cheng Qinyuan, Sun Tianxiang, Zhang Wenwei, et al. Evaluating hallucinations in Chinese large language models[J]. arXiv preprint, arXiv: 2310.03368, 2023
|
[103] |
Chuang Y S, Xie Yujia, Luo Hongyin, et al. DoLa: Decoding by contrasting layers improves factuality in large language models[J]. arXiv preprint, arXiv: 2309.03883, 2023
|
[104] |
Liu N F, Lin K, Hewitt J, et al. Lost in the middle: How language models use long contexts[J]. arXiv preprint, arXiv: 2307.03172, 2023
|
[105] |
Shi Weijia, Han Xiaochuang, Lewis M, et al. Trusting your evidence: hallucinate less with context-aware decoding[J]. arXiv preprint, arXiv: 2305.14739, 2023
|
[106] |
Li Yifei, Lin Zeqi, Zhang Shizhuo, et al. Making large language models better reasoners with step-aware verifier[J]. arXiv preprint, arXiv: 2206.02336, 2023
|
[107] |
Weng Yixuan, Zhu Minjun, Xia Fei, et al. Large language models are better reasoners with self-verification[J]. arXiv preprint, arXiv: 2212.09561, 2023
|
[108] |
Stechly K, Marquez M, Kambhampati S. GPT−4 doesn’t know it’s wrong: An analysis of iterative prompting for reasoning problems[J]. arXiv preprint, arXiv: 2310.12397, 2023
|
[109] |
El-Mhamdi E M, Farhadkhani S, Guerraoui R, et al. On the impossible safety of large AI models[J]. arXiv preprint, arXiv: 2209.15259, 2023
|
[110] |
Gehman S, Gururangan S, Sap M, et al. RealToxicityPrompts: Evaluating neural toxic degeneration in language models[J]. arXiv preprint, arXiv: 2009.11462, 2020
|
[111] |
Wang Boxin, Chen Weixin, Pei Hengzhi, et al. DecodingTrust: A comprehensive assessment of trustworthiness in GPT models[J]. arXiv preprint, arXiv: 2306.11698, 2023
|
[112] |
Urchs S, Thurner V, Aßenmacher M, et al. How prevalent is gender bias in ChatGPT? Exploring German and English ChatGPT responses[J]. arXiv preprint, arXiv: 2310.03031, 2023
|
[113] |
Nozza D, Bianchi F, Hovy D. HONEST: Measuring hurtful sentence completion in language models[C] //Proc of the 19th Conf of the North American Chapter of the ACL: Human Language Technologies. Stroudsburg, PA: ACL, 2021: 2398−2406
|
[114] |
Nadeem M, Bethke A, Reddy S. StereoSet: Measuring stereotypical bias in pretrained language models[J]. arXiv preprint, arXiv: 2004.09456, 2020
|
[115] |
Lucy L, Bamman D. Gender and representation bias in GPT−3 generated stories[C] //Proc of the 3rd Workshop on Narrative Understanding. Stroudsburg, PA: ACL, 2021: 48−55
|
[116] |
Abid A, Farooqi M, Zou J. Persistent anti-Muslim bias in large language models[C] //Proc of the 2021 AAAI/ACM Conf on AI, Ethics, and Society. New York: ACM, 2021: 298−306
|
[117] |
Patel R, Pavlick E. Was it “stated” or was it “claimed”?: How linguistic bias affects generative language models[C] //Proc of the 26th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 10080−10095
|
[118] |
Carlini N, Tramer F, Wallace E, et al. Extracting training data from large language models[C] //Proc of the 30th USENIX Security Symp (USENIX Security 21). Berkeley, CA: USENIX Association. 2021: 2633−2650
|
[119] |
Carlini N. A LLM assisted exploitation of AI-guardian[J]. arXiv preprint, arXiv: 2307.15008, 2023
|
[120] |
Li Jiazhao, Yang Yijin, Wu Zhuofeng, et al. ChatGPT as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger[J]. arXiv preprint, arXiv: 2304.14475, 2023
|
[121] |
Staab R, Vero M, Balunović M, et al. Beyond memorization: violating privacy via inference with large language models[J]. arXiv preprint, arXiv: 2310.07298, 2023
|
[122] |
Strubell E, Ganesh A, McCallum A. Energy and policy considerations for deep learning in NLP[C] //Proc of the 57th Annual Meeting of the ACL. Stroudsburg, PA: ACL, 2019: 3645−3650
|
[123] |
Shumailov I, Zhao Y, Bates D, et al. Sponge examples: Energy-latency attacks on neural networks[C] //Proc of the 2021 IEEE European Symp on Security and Privacy (EuroS&P). Piscataway, NJ: IEEE, 2021: 212−231
|
[124] |
Si W M, Backes M, Zhang Y, et al. Two-in-One: A model hijacking attack against text generation models[J]. arXiv preprint, arXiv: 2305.07406, 2023
|
[125] |
Tirumala K, Markosyan A H, Zettlemoyer L, et al. Memorization without overfitting: Analyzing the training dynamics of large language models[J]. arXiv preprint, arXiv: 2205.10770, 2022
|
[126] |
Carlini N, Chien S, Nasr M, et al. Membership inference attacks from first principles[C] // Proc of the 43rd IEEE Symp on Security and Privacy (SP). Piscataway, NJ: IEEE, 2022: 1897−1914
|
[127] |
Fu Wenjie, Wang Huandong, Gao Chen, et al. Practical membership inference attacks against fine-tuned large language models via self-prompt calibration[J]. arXiv preprint, arXiv: 2311.06062, 2023
|
[128] |
Kandpal N, Pillutla K, Oprea A, et al. User inference attacks on large language models[J]. arXiv preprint, arXiv: 2310.09266, 2023
|
[129] |
Duan Haonan, Dziedzic A, Yaghini M, et al. On the privacy risk of in-context learning[J]. arXiv preprint, arXiv: 2411.10512, 2024
|
[130] |
Lehman E, Jain S, Pichotta K, et al. Does BERT pretrained on clinical notes reveal sensitive data?[J]. arXiv preprint, arXiv: 2104.07762, 2021
|
[131] |
Kim S, Yun S, Lee H, et al. ProPILE: Probing privacy leakage in large language models[J]. arXiv preprint, arXiv: 2307.01881, 2023
|
[132] |
Nasr M, Carlini N, Hayase J, et al. Scalable extraction of training data from (production) language models[J]. arXiv preprint, arXiv: 2311.17035, 2023
|
[133] |
Pan Xudong, Zhang Mi, Ji Shouling, et al. Privacy risks of general-purpose language models[C] //Proc of the 41st IEEE Symp on Security and Privacy (SP). Piscataway, NJ: IEEE, 2020: 1314−1331
|
[134] |
Song Congzheng, Raghunathan A. Information leakage in embedding models[C] //Proc of the 27th ACM SIGSAC Conf on Computer and Communications Security. New York: ACM, 2020: 377−390
|
[135] |
Li Haoran, Xu Mingshi, Song Yangqiu. Sentence embedding leaks more information than You expect: Generative embedding inversion attack to recover the whole sentence[J]. arXiv preprint, arXiv: 2305.03010, 2023
|
[136] |
Zhang R, Hidano S, Koushanfar F. Text revealer: private text reconstruction via model inversion attacks against transformers[J]. arXiv preprint, arXiv: 2209.10505, 2022
|
[137] |
Li Haoran, Guo Dadi, Fan Wei, et al. Multi-step jailbreaking privacy attacks on ChatGPT[J]. arXiv preprint, arXiv: 2304.05197, 2023
|
[138] |
Shokri R, Stronati M, Song Congzheng, et al. Membership inference attacks against machine learning models[C] //Proc of the 38th IEEE Symp on Security and Privacy (SP). Piscataway, NJ: IEEE, 2017: 3−18
|
[139] |
Mireshghallah F, Goyal K, Uniyal A, et al. Quantifying privacy risks of masked language models using membership inference attacks[C] //Proc of the 27th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2022: 8332−8347
|
[140] |
Carlini N, Ippolito D, Jagielski M, et al. Quantifying memorization across neural language models[J]. arXiv preprint, arXiv: 2202.07646, 2023
|
[141] |
Huang Jie, Shao Hanyin, Chang K C C. Are large pre-trained language models leaking your personal information?[J]. arXiv preprint, arXiv: 2205.12628, 2022
|
[142] |
Gupta S, Huang Y, Zhong Z, et al. recovering private text in federated learning of language models[J]. arXiv preprint, arXiv: 2205.08514, 2022
|
[143] |
Krishna K, Tomar G S, Parikh A P, et al. Thieves on sesame street! Model extraction of BERT-based APIs[J]. arXiv preprint, arXiv: 1910.12366, 2020
|
[144] |
He Xuanli, Lyu Lingjuan, Xu Qiongkai, et al. Model extraction and adversarial transferability, your BERT is vulnerable![J]. arXiv preprint, arXiv: 2103.10013, 2021
|
[145] |
Dziedzic A, Boenisch F, Jiang M, et al. Sentence embedding encoders are easy to steal but hard to defend[C]//Proc of the 11th ICLR Workshop on Pitfalls of limited data and computation for Trustworthy ML. Lausanne, Switzerland: ICLR Foundation. 2023: 2364−2378
|
[146] |
Xu Qiongkai, He Xuanli, Lyu Lingjuan, et al. Student surpasses teacher: Imitation attack for black-box NLP APIs[J]. arXiv preprint, arXiv: 2108.13873, 2022
|
[147] |
Zanella-Beguelin S, Tople S, Paverd A, et al. Grey-box extraction of natural language models[C] //Proc of the 38th Int Conf on Machine Learning. New York : PMLR, 2021: 12278−12286
|
[148] |
Jiang Yi, Shi Chenghui, Ma Oubo, et al. Text laundering: Mitigating malicious features through knowledge distillation of large foundation models[C] //Proc of the 19th Int Conf on Information Security and Cryptology. Singapore: Springer Nature Singapore, 2023: 3−23
|
[149] |
Gudibande A, Wallace E, Snell C, et al. The false promise of imitating proprietary LLMs[J]. arXiv preprint, arXiv: 2305.15717, 2023
|
[150] |
Zhang Yiming, Carlini N, Ippolito D. Effective prompt extraction from language models[J]. arXiv preprint, arXiv: 2307.06865, 2024
|
[151] |
Yang Yong, Zhang Xuhong,3 Jiang Yi, et al. PRSA: Prompt reverse stealing attacks against large language models[J]. arXiv preprint, arXiv: 2402.19200, 2024
|
[152] |
Sha Z, Zhang Y. Prompt stealing attacks against large language models[J]. arXiv preprint, arXiv: 2402.12959, 2024
|
[153] |
Morris J X, Zhao Wenting, Chiu J T, et al. Language model inversion[J]. arXiv preprint, arXiv: 2311.13647, 2023.
|
[154] |
Cheng Yong, Jiang Lu, Macherey W, et al. AdvAug: Robust adversarial augmentation for neural machine translation[C] //Proc of the 58th Annual Meeting of the ACL. Stroudsburg, PA: ACL, 2020: 5961−5970
|
[155] |
Minervini P, Riedel S. Adversarially regularising neural NLI models to integrate logical background knowledge[J]. arXiv preprint, arXiv: 1808.08609, 2018
|
[156] |
Du Tianyu, Ji Shouling, Shen Lujia, et al. Cert-RNN: Towards certifying the robustness of recurrent neural networks[C] //Proc of the 28th ACM SIGSAC Conf on Computer and Communications Security. New York: ACM, 2021: 516−534
|
[157] |
Qi Fanchao, Chen Yangyi, Li Mukai, et al. ONION: A simple and effective defense against textual backdoor attacks[C] //Proc of the 26th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 9558−9566
|
[158] |
Azizi A, Tahmid I A, Waheed A, et al. T-Miner: A generative approach to defend against trojan attacks on DNN-based text classification[C] //Proc of the 30th USENIX Security Symp (USENIX Security 21). Berkeley, CA: USENIX Association, 2021: 2255−2272vv
|
[159] |
Li Yige, Lyu Xixiang, Koren N, et al. Neural attention distillation: erasing backdoor triggers from deep neural networks[J]. arXiv preprint, arXiv: 2101.05930, 2021
|
[160] |
Liu Kang, Dolan-Gavitt B, Garg S. Fine-pruning: Defending against backdooring attacks on deep neural networks[J]. arXiv preprint, arXiv: 1805.12185, 2018
|
[161] |
Xu Chang, Wang Jun, Guzmán F, et al. Mitigating data poisoning in text classification with differential privacy[C] // Proc of the 26th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 4348−4356
|
[162] |
Chen Chuanshuai, Dai Jiazhu. Mitigating backdoor attacks in LSTM-based text classification systems by backdoor keyword identification[J]. Neurocomputing, 2021, 452: 253−262 doi: 10.1016/j.neucom.2021.04.105
|
[163] |
Wang Yuxia, Li Haonan, Han Xudong, et al. Do-Not-Answer: A dataset for evaluating safeguards in LLMs[J]. arXiv preprint, arXiv: 2308.13387, 2023
|
[164] |
Zhang Mi, Pan Xudong, Yang Min. JADE: A linguistics-based safety evaluation platform for large language models[J]. arXiv preprint, arXiv: 2311.00286, 2023
|
[165] |
Wang Mengru, Zhang Ningyu, Xu Ziwen, et al. Detoxifying large language models via knowledge editing[J]. arXiv preprint, arXiv: 2403.14472, 2024
|
[166] |
Chen Jiefeng, Yoon J, Ebrahimi S, et al. Adaptation with self-evaluation to improve selective prediction in LLMs[J]. arXiv preprint, arXiv: 2310.11689, 2023
|
[167] |
Feldman P, Foulds J R, Pan S. Trapping LLM hallucinations using tagged context prompts[J]. arXiv preprint, arXiv: 2306.06085, 2023
|
[168] |
Duan Haonan, Dziedzic A, Papernot N, et al. Flocks of stochastic parrots: Differentially private prompt learning for large language models[J]. Advances in Neural Information Processing Systems, 2023, 36: 76852−76871
|
[169] |
Yu Da, Naik S, Backurs A, et al. Differentially private fine-tuning of language models[J]. arXiv preprint, arXiv: 2110.06500, 2022
|
[170] |
Mattern J, Jin Zhijing, Weggenmann B, et al. Differentially private language models for secure data sharing[J]. arXiv preprint, arXiv: 2210.13918, 2022
|
[171] |
Liu Xuanqi, Liu Zhuotao. LLMs can understand encrypted prompt: Towards privacy-computing friendly transformers[J]. arXiv preprint, arXiv: 2305.18396, 2023
|
[172] |
Li Yansong, Tan Zhixing, Liu Yang. Privacy-preserving prompt tuning for large language model services[J]. arXiv preprint, arXiv: 2305.06212, 2023
|
[173] |
Yan Ran, Li Yujun, Li Wenqian, et al. Teach large language models to forget privacy[J]. arXiv preprint, arXiv: 2401.00870, 2023
|
[174] |
Ippolito D, Tramèr F, Nasr M, et al. Preventing verbatim memorization in language models gives a false sense of privacy[J]. arXiv preprint, arXiv: 2210.17546, 2023
|
[175] |
Liu Sijia, Yao Yuanshun, Jia Jinghan, et al. Rethinking machine unlearning for large language models[J]. arXiv preprint, arXiv: 2402.08787, 2024
|
[176] |
Jang J, Yoon D, Yang S, et al. Knowledge unlearning for mitigating privacy risks in language models[C] //Proc of the 61st Annual Meeting of the ACL (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2023: 14389−14408
|
[177] |
Chen Jiaao, Yang Diyi. Unlearn what you want to forget: Efficient unlearning for LLMs[C] //Proc of the 28th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 12041−12052
|
[178] |
Maini P, Feng Z, Schwarzschild A, et al. TOFU: A task of fictitious unlearning for LLMs[J]. arXiv preprint, arXiv: 2401.06121, 2024
|
[179] |
Zhang Ruiqi, Lin Licong, Bai Yu, et al. Negative preference optimization: From catastrophic collapse to effective unlearning[J]. arXiv preprint, arXiv: 2404.05868, 2024
|
[180] |
Tian Bozhong, Liang Xiaozhuan, Cheng Siyuan, et al. To forget or not? Towards practical knowledge unlearning for large language models[C] // Proc of the 29th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2024: 1524−1537
|
[181] |
Muresanu A, Thudi A, Zhang M R, et al. Unlearnable algorithms for in-context learning[J]. arXiv preprint, arXiv: 2402.00751, 2024
|
[182] |
Cohen R, Biran E, Yoran O, et al. Evaluating the ripple effects of knowledge editing in language models[J]. Transactions of the ACL, 2024, 12: 283−298
|
[183] |
Yan Jianhao, Wang Futing, Li Yafu, et al. Potential and challenges of model editing for social debiasing[J]. arXiv preprint, arXiv: 2402.13462, 2024
|
[184] |
Liu C Y, Wang Yaxuan, Flanigan J, et al. Large language model unlearning via embedding-corrupted prompts[J]. arXiv preprint, arXiv: 2406.07933, 2024
|
[185] |
Wu Xinwei, Li Junzhuo, Xu Minghui, et al. DEPN: Detecting and editing privacy neurons in pretrained language models[C] //Proc of the 28th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 2875−2886
|
[186] |
Wu Xinwei, Dong Weilong, Xu Shaoyang, et al. Mitigating privacy seesaw in large language models: Augmented privacy neuron editing via activation patching[C] //proc of the 62nd ACL. Stroudsburg, PA: ACL, 2024: 5319−5332
|
[187] |
Venditti D, Ruzzetti E S, Xompero G A, et al. Enhancing data privacy in large language models through private association editing[J]. arXiv preprint, arXiv: 2406.18221, 2024
|