• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Jiang Yi, Yang Yong, Yin Jiali, Liu Xiaolei, Li Jiliang, Wang Wei, Tian Youliang, Wu Yingcai, Ji Shouling. A Survey on Security and Privacy Risks in Large Language Models[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440265
Citation: Jiang Yi, Yang Yong, Yin Jiali, Liu Xiaolei, Li Jiliang, Wang Wei, Tian Youliang, Wu Yingcai, Ji Shouling. A Survey on Security and Privacy Risks in Large Language Models[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440265

A Survey on Security and Privacy Risks in Large Language Models

Funds: This work was supported by the National Key Research and Development Program of China (2022YFB3102100) and the National Natural Science Foundation of China (U244120033, U24A20336).
More Information
  • Author Bio:

    Jiang Yi: born in 1982. PhD candidate. member of CCF. His main research interests include data-driven security and privacy, AI security

    Yang Yong: born in 1996. PhD candidate. member of CCF. His main research interests include AI security and privacy

    Yin Jiali: born in 1993. PhD,professor PhD supervisor. member of CCF. Her main research interests include computational photography, interpretability of neural networks, adversarial training

    Liu Xiaolei: born in 1992. PhD, associate professor. Member of CCF. His current research interests include equipment information security and AI security

    Li Jiliang: born in 1989. PhD, professor, PhD supervisor. Member of CCF. Distinguished member of CCF. His current research interests include AI security and large language model security

    Wang Wei: born in 1976. PhD, professor, PhD supervisor. His current research interests include cyberspace and system security, blockchain, and privacy preservation

    Tian Youliang: born in 1982. PhD, professor, PhD supervisor. Senior member of CCF. His current research interests include cryptography and security protocols, big data security and privacy protection

    Wu Yingcai: born in 1983. PhD, professor, PhD supervisor. Senior member of CCF. His current research interests include visual analytics

    Ji Shouling: born in 1986. PhD, professor, PhD supervisor. Senior member of CCF. His current research interests include data-driven security and privacy, AI security and big data mining and analytics

  • Received Date: April 17, 2024
  • Revised Date: February 16, 2025
  • Accepted Date: March 02, 2025
  • Available Online: March 02, 2025
  • In recent years, Large Language Models (LLMs) have emerged as a critical branch of deep learning network technology, achieving a series of breakthrough accomplishments in the field of Natural Language Processing (NLP), and gaining widespread adoption. However, throughout their entire lifecycle, including pre-training, fine-tuning, and actual deployment, a variety of security threats and risks of privacy breaches have been discovered, drawing increasing attention from both the academic and industrial sectors. Navigating the development of the paradigm of using large language models to handle natural language processing tasks, as known as the pre-training and fine-tuning paradigm, the pre-training and prompt learning paradigm, and the pre-training and instruction-tuning paradigm, this article outline conventional security threats against large language models, specifically representative studies on the three types of traditional adversarial attacks (adversarial example attack, backdoor attack and poisoning attack). It then summarizes some of the novel security threats revealed by recent research, followed by a discussion on the privacy risks of large language models and the progress in their research. The content aids researchers and deployers of large language models in identifying, preventing, and mitigating these threats and risks during the model design, training, and application processes, while also achieving a balance between model performance, security, and privacy protection.

  • [1]
    Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. arXiv preprint, arXiv: 1706.03762, 2023
    [2]
    Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[EB/OL]. 2018[2024-05-23]. https://hayate-lab.com/wp-content/uploads/2023/05/43372bfa750340059ad87ac8e538c53b.pdf
    [3]
    Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint, arXiv: 1810.04805, 2019
    [4]
    Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint, arXiv: 1301.3781, 2013
    [5]
    Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation[C] //Proc of the 19th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2014: 1532−1543
    [6]
    陈慧敏,刘知远,孙茂松. 大语言模型时代的社会机遇与挑战[J]. 计算机研究与发展,2024,61(5):1094−1103 doi: 10.7544/issn1000-1239.202330700

    Chen Huimin, Liu Zhiyuan, Sun Maosong. The social opportunities and challenges in the era of large language models[J]. Journal of Computer Research and Development, 2024, 61(5): 1094−1103 (in Chinese) doi: 10.7544/issn1000-1239.202330700
    [7]
    Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[J]. arXiv preprint, arXiv: 2005.14165, 2020
    [8]
    Morris J X, Lifland E, Yoo J Y, et al. TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP[J]. arXiv preprint, arXiv: 2005.05909, 2020
    [9]
    Zeng Guoyang, Qi Fanchao, Zhou Qianrui, et al. OpenAttack: An open-source textual adversarial attack toolkit[C] //Proc of the 59th Annual Meeting of the ACL and the 11th Int Joint Conf on Natural Language Processing: System Demonstrations. Stroudsburg, PA: ACL, 2021: 363−371
    [10]
    Wang Boxin, Xu Chejian, Wang Shuohang, et al. Adversarial GLUE: A multi-task benchmark for robustness evaluation of language models[J]. arXiv preprint, arXiv: 2111.02840, 2021
    [11]
    Papernot N, McDaniel P, Swami A, et al. Crafting adversarial input sequences for recurrent neural networks[C] //Proc of the 2016 IEEE Military Communications Conf. Piscataway, NJ: IEEE, 2016: 49−54
    [12]
    Liang Bin, Li Hongcheng, Su Miaoqiang, et al. Deep text classification can be fooled[C] //Proc of the 27th Int Joint Conf on Artificial Intelligence. Stockholm, Sweden: International Joint Conference on Artificial Intelligence Organization, 2018: 4208−4215
    [13]
    Ebrahimi J, Rao A, Lowd D, et al. HotFlip: White-box adversarial examples for text classification[J]. arXiv preprint, arXiv: 1712.06751, 2018
    [14]
    Li Jinfeng, Ji Shouling, Du Tianyu, et al. TextBugger: Generating adversarial text against real-world applications[J]. arXiv preprint, arXiv: 1812.05271, 2018
    [15]
    Behjati M, Moosavi-Dezfooli S M, Baghshah M S, et al. Universal adversarial attacks on text classifiers[C] // Proc of the 44th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 7345−7349
    [16]
    Wallace E, Feng S, Kandpal N, et al. Universal adversarial triggers for attacking and analyzing NLP[J]. arXiv preprint, arXiv: 1908.07125, 2021
    [17]
    Gao J, Lanchantin J, Soffa M L, et al. Black-box generation of adversarial text sequences to evade deep learning classifiers[J]. arXiv preprint, arXiv: 1801.04354, 2018
    [18]
    Eger S, Şahin G G, Rücklé A, et al. Text processing like humans do: Visually attacking and shielding NLP systems[C] //Proc of the 18th Conf of the North American Chapter of the ACL: Human Language Technologies (Volume 1: Long and Short Papers). Stroudsburg, PA: ACL, 2019: 1634−1647
    [19]
    Jin Di, Jin Zhijing, Zhou J, et al. Is BERT really robust? A strong baseline for natural language attack on text classification and entailment[J]. arXiv preprint, arXiv: 1907.11932, 2020
    [20]
    Li Linyang, Ma Ruotian, Guo Qipeng, et al. BERT-ATTACK: Adversarial attack against BERT using BERT[J]. arXiv preprint, arXiv: 2004.09984, 2020
    [21]
    Garg S, Ramakrishnan G. BAE: BERT-based adversarial examples for text classification[C] //Proc of the 25th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 6174−6181
    [22]
    Li Dianqi, Zhang Yizhe, Peng Hao, et al. Contextualized perturbation for textual adversarial attack[J]. arXiv preprint, arXiv: 2009.07502, 2021
    [23]
    Maheshwary R, Maheshwary S, Pudi V. Generating natural language attacks in a hard label black box setting[C] //Proc of the 35th AAAI Conf on Artificial Intelligence. Palo Alto, CA : AAAI, 2021: 13525−13533
    [24]
    Ye Muchao, Miao Chenglin, Wang Ting, et al. TextHoaxer: Budgeted hard-label adversarial attacks on text[C] // Proc of the 36th AAAI Conf on Artificial Intelligence. Palo Alto, CA : AAAI, 2022: 3877−3884
    [25]
    Xu Lei, Chen Yangyi, Cui Ganqu, et al. Exploring the universal vulnerability of prompt-based learning paradigm[J]. arXiv preprint, arXiv: 2204.05239, 2022
    [26]
    Xu Lei, Chen Yangyi, Cui Ganqu, et al. A prompting-based approach for adversarial example generation and robustness enhancement[J]. arXiv preprint, arXiv: 2203.10714, 2022
    [27]
    Liu X,Zheng Y,Du Z,et al. GPT understands,too[J]. AI Open,2024,5:208-215(只有卷
    [28]
    Lu Yao, Bartolo M, Moore A, et al. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity[J]. arXiv preprint, arXiv: 2104.08786, 2022
    [29]
    Wang Jiongxiao, Liu Zichen, Park K H, et al. Adversarial demonstration attacks on large language models[J]. arXiv preprint, arXiv: 2305.14950, 2023
    [30]
    Qiang Yao, Zhou Xiangyu, Zhu Dongxiao. Hijacking large language models via adversarial in-context learning[J]. arXiv preprint, arXiv: 2311.09948, 2023
    [31]
    Shin T, Razeghi Y, Logan IV R L, et al. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts[C] //Proc of the 25th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 4222−4235
    [32]
    Shi Yundi, Li Piji, Yin Changchun, et al. PromptAttack: Prompt-based attack for language models via gradient search[J]. arXiv preprint, arXiv: 2209.01882, 2022
    [33]
    李南,丁益东,江浩宇,等. 面向大语言模型的越狱攻击综述[J]. 计算机研究与发展,2024,61(5):1156−1181 doi: 10.7544/issn1000-1239.202330962

    Li Nan, Ding Yidong, Jiang Haoyu, et al. Jailbreak attack for large language models: A survey[J]. Journal of Computer Research and Development, 2024, 61(5): 1156−1181 (in Chinese) doi: 10.7544/issn1000-1239.202330962
    [34]
    Perez F, Ribeiro I. Ignore previous prompt: Attack techniques for language models[J]. arXiv preprint, arXiv: 2211.09527, 2022
    [35]
    Shen Xinyue, Chen Zeyuan, Backes M, et al. “Do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models[J]. arXiv preprint, arXiv: 2308.03825, 2023
    [36]
    Gu Xiangming, Zheng Xiaosen, Pang Tianyu, et al. Agent Smith: A single image can jailbreak one million multimodal LLM agents exponentially fast[J]. arXiv preprint, arXiv: 2402.08567, 2024
    [37]
    Zhan Qiusi, Liang Zhixiang, Ying Zifan, et al. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents[J]. arXiv preprint arXiv: 2403.02691, 2024
    [38]
    Wei A, Haghtalab N, Steinhardt J. Jailbroken: How does LLM safety training fail?[J]. arXiv preprint, arXiv: 2307.02483, 2023
    [39]
    Liu Yi, Deng Gelei, Li Yuekang, et al. Prompt injection attack against LLM-integrated applications[J]. arXiv preprint, arXiv: 2306.05499, 2023
    [40]
    Abdelnabi S, Greshake K, Mishra S, et al. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection[C] //Proc of the 16th ACM Workshop on Artificial Intelligence and Security. New York: ACM, 2023: 79−90
    [41]
    Zou A, Wang Zifan, Kolter J Z, et al. Universal and transferable adversarial attacks on aligned language models[J]. arXiv preprint, arXiv: 2307.15043, 2023
    [42]
    Zhu Sicheng, Zhang Ruiyi, An Bang, et al. AutoDAN: Interpretable gradient-based adversarial attacks on large language models[J]. arXiv preprint, arXiv: 2310.15140, 2023
    [43]
    Dai Jiazhu, Chen Chuanshuai, Li Yufeng. A backdoor attack against LSTM-based text classification systems[J]. IEEE Access, 2019, 7: 138872−138878 doi: 10.1109/ACCESS.2019.2941376
    [44]
    Yang Wenkai, Lin Yankai, Li Peng, et al. Rethinking stealthiness of backdoor attack against NLP models[C] //Proc of the 59th Annual Meeting of the ACL and the 11th Int Joint Conf on Natural Language Processing (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2021: 5543−5557
    [45]
    Kwon H, Lee S. Textual backdoor attack for the text classification system[J]. Security and Communication Networks, 2021, 2021(1): 1−11
    [46]
    Kurita K, Michel P, Neubig G. Weight poisoning attacks on pre-trained models[J]. arXiv preprint, arXiv: 2004.06660, 2020
    [47]
    Chen Xiaoyi, Salem A, Chen Dingfan, et al. BadNL: Backdoor attacks against NLP models with semantic-preserving improvements[C] //Proc of the 37th Annual Computer Security Applications Conf. New York: ACM, 2021: 554−569
    [48]
    Lu Hengyang, Fan Chenyou, Yang Jun, et al. Where to attack: A dynamic locator model for backdoor attack in text classifications[C] //Proc of the 29th Int Conf on Computational Linguistics. Gyeongju, Republic of Korea: International Committee on Computational Linguistics, 2022: 984−993
    [49]
    Qi Fanchao, Li Mukai, Chen Yangyi, et al. Hidden Killer: Invisible textual backdoor attacks with syntactic trigger[J]. arXiv preprint, arXiv: 2105.12400, 2021
    [50]
    Qi Fanchao, Chen Yangyi, Zhang Xurui, et al. Mind the style of text! adversarial and backdoor attacks based on text style transfer[J]. arXiv preprint, arXiv: 2110.07139, 2021
    [51]
    Pan Xudong, Zhang Mi, Sheng Beina, et al. Hidden trigger backdoor attack on NLP models via linguistic style manipulation [C]//Proc of the 31st USENIX Security Symp (USENIX Security 22). Berkeley, CA: USENIX Association, 2022: 3611−3628
    [52]
    Kirkpatrick J, Pascanu R, Rabinowitz N, et al. Overcoming catastrophic forgetting in neural networks[J]. Proceedings of the National Academy of Sciences, 2017, 114(13): 3521−3526 doi: 10.1073/pnas.1611835114
    [53]
    Li Linyang, Song Demin, Li Xiaonan, et al. Backdoor attacks on pre-trained models by layerwise weight poisoning[J]. arXiv preprint, arXiv: 2108.13888, 2021
    [54]
    Yang Wenkai, Li Lei, Zhang Zhiyuan, et al. Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in NLP models[J]. arXiv preprint, arXiv: 2103.15543, 2021
    [55]
    Merity S, Xiong C, Bradbury J, et al. Pointer sentinel mixture models[J]. arXiv preprint, arXiv: 1609.07843, 2016
    [56]
    Zhang Xinyang, Zhang Zheng, Ji Shouling, et al. Trojaning language models for fun and profit[C] //Proc of the 6th IEEE European Symp on Security and Privacy (EuroS&P). Piscataway, NJ: IEEE, 2021: 179−197
    [57]
    Chen Kangjie, Meng Yuxian, Sun Xiaofei, et al. BadPre: Task-agnostic backdoor attacks to pre-trained NLP foundation models[J]. arXiv preprint, arXiv: 2110.02467, 2021
    [58]
    Shen Lujia, Ji Shouling, Zhang Xuhong, et al. Backdoor pre-trained models can transfer to all[C] //Proc of the 28th ACM SIGSAC Conf on Computer and Communications Security. New York: ACM, 2021: 3141−3158
    [59]
    Zhang Zhengyan, Xiao Guangxuan, Li Yongwei, et al. Red alarm for pre-trained models: Universal vulnerability to neuron-level backdoor attacks[J]. Machine Intelligence Research, 2023, 20(2): 180−193 doi: 10.1007/s11633-022-1377-5
    [60]
    Du Wei, Li Peixuan, Li Boqun, et al. UOR: Universal backdoor attacks on pre-trained language models[J]. arXiv preprint, arXiv: 2305.09574, 2023
    [61]
    Du Wei, Zhao Yichun, Li Boqun, et al. PPT: Backdoor attacks on pre-trained models via poisoned prompt tuning[C] //Proc of the 31st Int Joint Conf on Artificial Intelligence. Palo Alto, CA: International Joint Conference on Artificial Intelligence Organization, 2022: 680−686
    [62]
    Cai Xiangrui, Xu Haidong, Xu Sihan, et al. BadPrompt: Backdoor attacks on continuous prompts[J]. arXiv preprint, arXiv: 2211.14719, 2022
    [63]
    Zhao Shuai, Wen Jinming, Tuan L A, et al. Prompt as triggers for backdoor attack: examining the vulnerability in language models[C] //Proc of the 28th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 12303−12317
    [64]
    Mei Kai, Li Zheng, Wang Zhenting, et al. NOTABLE: Transferable backdoor attacks against prompt-based NLP models[J]. arXiv preprint, arXiv: 2305.17826, 2023
    [65]
    Xu Jiashu, Ma M D, Wang Fei, et al. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models[J]. arXiv preprint, arXiv: 2305.14710, 2023
    [66]
    Yan Jun, Yadav V, Li Shiyang, et al. Backdooring instruction-tuned large language models with virtual prompt injection[J]. arXiv preprint, arXiv: 2307.16888, 2024
    [67]
    Rando J, Tramèr F. Universal jailbreak backdoors from poisoned human feedback[J]. arXiv preprint, arXiv: 2311.14455, 2024
    [68]
    Yang Wenkai, Bi Xiaohan, Lin Yankai, et al. Watch out for your agents! Investigating backdoor threats to LLM-based agents[J]. arXiv preprint, arXiv: 2402.11208, 2024
    [69]
    Wang Yifei, Xue Dizhan, Zhang Shengjie, et al. BadAgent: Inserting and activating backdoor attacks in LLM agents[J]. arXiv preprint, arXiv: 2406.03007, 2024
    [70]
    Yao Yunzhi, Wang Peng, Tian Bozhong, et al. Editing large language models: problems, methods, and opportunities[C] //Proc of the 28th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 10222−10240
    [71]
    Zhang Ningyu, Yao Yunzhi, Tian Bozhong, et al. A comprehensive study of knowledge editing for large language models[J]. arXiv preprint, arXiv: 2401.01286, 2024
    [72]
    Li Yanzhou, Li Tianlin, Chen Kangjie, et al. BadEdit: Backdooring large language models by model editing[J]. arXiv preprint, arXiv: 2403.13355, 2024
    [73]
    Qiu Jiyang, Ma Xinbei, Zhang Zhuosheng, et al. MEGen: Generative backdoor in large language models via model editing[J]. arXiv preprint, arXiv: 2408.10722, 2024
    [74]
    Wang Hao, Guo Shangwei, He Jialing, et al. EvilEdit: Backdooring text-to-image diffusion models in one second[C] //Proc of the 32nd ACM Int Conf on Multimedia. New York: ACM, 2024: 3657−3665
    [75]
    Barreno M, Nelson B, Sears R, et al. Can machine learning be secure?[C] //Proc of the 1st ACM Symp on Information, computer and communications security. New York: ACM, 2006: 16−25
    [76]
    Wang Gang, Wang Tianyi, Zheng Haitao, et al. Man vs. Machine: Practical adversarial detection of malicious crowdsourcing workers[C]. //Proc of the 23rd USENIX Security Symp (USENIX Security 2014). Berkeley, CA: USENIX Association, 2014: 239−254
    [77]
    Miao Chenglin, Li Qi, Xiao Houping, et al. Towards data poisoning attacks in crowd sensing systems[C] //Proc of the 18th ACM Int Symp on Mobile Ad Hoc Networking and Computing. New York: ACM, 2018: 111−120
    [78]
    Feng Ji, Cai Qizhi, Zhou Zhihua. Learning to confuse: Generating training time adversarial data with auto-encoder[J]. arXiv preprint, arXiv: 1905.09027, 2019
    [79]
    Shafahi A, Huang W R, Najibi M, et al. Poison frogs! Targeted clean-label poisoning attacks on neural networks[J]. arXiv preprint, arXiv: 1804.00792, 2018
    [80]
    Zhu Chen, Huang W R, Li Hengduo, et al. Transferable clean-label poisoning attacks on deep neural nets[C] //Proc of the 36th Int Conf on Machine Learning. New York : PMLR, 2019: 7614−7623
    [81]
    Wallace E, Zhao T Z, Feng S, et al. Concealed data poisoning attacks on NLP Models[J]. arXiv preprint, arXiv: 2010.12563, 2020
    [82]
    Jagielski M, Severi G, Pousette Harger N, et al. Subpopulation data poisoning Attacks[C] //Proc of the 28th ACM SIGSAC Conf on Computer and Communications Security. New York: ACM, 2021: 3104−3122
    [83]
    Schuster R, Song C, Tromer E, et al. You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion [C] //Proc of the 30th USENIX Security Symp (USENIX Security 21). Berkeley, CA: USENIX Association. 2021: 1559−1575
    [84]
    He Pengfei, Xu Han, Xing Yue, et al. Data poisoning for in-context Learning[J]. arXiv preprint, arXiv: 2402.02160, 2024
    [85]
    Hendel R, Geva M, Globerson A. In-context learning creates task vectors[C] // Proc of the 28th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 9318−9333
    [86]
    Peng Baolin, Li Chunyuan, He Pengcheng, et al. Instruction tuning with GPT−4[J]. arXiv preprint, arXiv: 2304.03277, 2023
    [87]
    Shu Manli, Wang Jiongxiao, Zhu Chen, et al. On the Exploitability of instruction tuning[J]. arXiv preprint, arXiv: 2306.17194, 2023
    [88]
    Wan A, Wallace E, Shen S, et al. Poisoning language models during instruction tuning[J]. arXiv preprint, arXiv: 2305.00944, 2023
    [89]
    Qiang Yao, Zhou Xiangyu, Zade S Z, et al. Learning to poison large language models during instruction tuning[J]. arXiv preprint, arXiv: 2402.13459, 2024
    [90]
    Zou Wei, Geng Runpeng, Wang Binghui, et al. PoisonedRAG: Knowledge poisoning attacks to retrieval-augmented generation of large language models[J]. arXiv preprint, arXiv: 2402.07867, 2024
    [91]
    Touvron H, Martin L, Stone K, et al. Llama 2: Open foundation and fine-tuned chat models[J]. arXiv preprint, arXiv: 2307.09288, 2023
    [92]
    陈炫婷,叶俊杰,祖璨,等. GPT系列大语言模型在自然语言处理任务中的鲁棒性[J]. 计算机研究与发展,2024,61(5):1128−1142 doi: 10.7544/issn1000-1239.202330801

    Chen Xuanting, Ye Junjie, Zu Can. Robustness of GPT large language models on natural language processing tasks[J]. Journal of Computer Research and Development, 2024, 61(5): 1128−1142 (in Chinese) doi: 10.7544/issn1000-1239.202330801
    [93]
    Huang Lei, Yu Weijiang, Ma Weitao, et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions[J]. arXiv preprint, arXiv: 2311.05232, 2023
    [94]
    Lin S, Hilton J, Evans O. TruthfulQA: Measuring how models mimic human falsehoods[J]. arXiv preprint, arXiv: 2109.07958, 2022
    [95]
    Lee K, Ippolito D, Nystrom A, et al. Deduplicating training data makes language models better[J]. arXiv preprint, arXiv: 2107.06499, 2022
    [96]
    Yu Fangyi, Quartey L, Schilder F. Legal prompting: Teaching a language model to think like a lawyer[J]. arXiv preprint, arXiv: 2212.01326, 2022
    [97]
    Li Yunxiang, Li Zihan, Zhang Kai, et al. ChatDoctor: A medical chat model fine-tuned on a large language model meta-AI (LLaMA) using medical domain knowledge[J]. arXiv preprint, arXiv: 2303.14070, 2023
    [98]
    Li Zuchao, Zhang Shitou, Zhao Hai, et al. BatGPT: A bidirectional autoregessive talker from generative pre-trained transformer[J]. arXiv preprint, arXiv: 2307.00360, 2023
    [99]
    Wang Chaojun, Sennrich R. On exposure bias, hallucination and domain shift in neural machine translation[C] //Proc of the 58th Annual Meeting of the ACL. Stroudsburg, PA: ACL, 2020: 3544−3552
    [100]
    Zhang Muru, Press O, Merrill W, et al. How language model hallucinations can snowball[J]. arXiv preprint, arXiv: 2305.13534, 2023
    [101]
    Perez E, Ringer S, Lukošiūtė K, et al. Discovering language model behaviors with model-written evaluations[J]. arXiv preprint, arXiv: 2212.09251, 2022
    [102]
    Cheng Qinyuan, Sun Tianxiang, Zhang Wenwei, et al. Evaluating hallucinations in Chinese large language models[J]. arXiv preprint, arXiv: 2310.03368, 2023
    [103]
    Chuang Y S, Xie Yujia, Luo Hongyin, et al. DoLa: Decoding by contrasting layers improves factuality in large language models[J]. arXiv preprint, arXiv: 2309.03883, 2023
    [104]
    Liu N F, Lin K, Hewitt J, et al. Lost in the middle: How language models use long contexts[J]. arXiv preprint, arXiv: 2307.03172, 2023
    [105]
    Shi Weijia, Han Xiaochuang, Lewis M, et al. Trusting your evidence: hallucinate less with context-aware decoding[J]. arXiv preprint, arXiv: 2305.14739, 2023
    [106]
    Li Yifei, Lin Zeqi, Zhang Shizhuo, et al. Making large language models better reasoners with step-aware verifier[J]. arXiv preprint, arXiv: 2206.02336, 2023
    [107]
    Weng Yixuan, Zhu Minjun, Xia Fei, et al. Large language models are better reasoners with self-verification[J]. arXiv preprint, arXiv: 2212.09561, 2023
    [108]
    Stechly K, Marquez M, Kambhampati S. GPT−4 doesn’t know it’s wrong: An analysis of iterative prompting for reasoning problems[J]. arXiv preprint, arXiv: 2310.12397, 2023
    [109]
    El-Mhamdi E M, Farhadkhani S, Guerraoui R, et al. On the impossible safety of large AI models[J]. arXiv preprint, arXiv: 2209.15259, 2023
    [110]
    Gehman S, Gururangan S, Sap M, et al. RealToxicityPrompts: Evaluating neural toxic degeneration in language models[J]. arXiv preprint, arXiv: 2009.11462, 2020
    [111]
    Wang Boxin, Chen Weixin, Pei Hengzhi, et al. DecodingTrust: A comprehensive assessment of trustworthiness in GPT models[J]. arXiv preprint, arXiv: 2306.11698, 2023
    [112]
    Urchs S, Thurner V, Aßenmacher M, et al. How prevalent is gender bias in ChatGPT? Exploring German and English ChatGPT responses[J]. arXiv preprint, arXiv: 2310.03031, 2023
    [113]
    Nozza D, Bianchi F, Hovy D. HONEST: Measuring hurtful sentence completion in language models[C] //Proc of the 19th Conf of the North American Chapter of the ACL: Human Language Technologies. Stroudsburg, PA: ACL, 2021: 2398−2406
    [114]
    Nadeem M, Bethke A, Reddy S. StereoSet: Measuring stereotypical bias in pretrained language models[J]. arXiv preprint, arXiv: 2004.09456, 2020
    [115]
    Lucy L, Bamman D. Gender and representation bias in GPT−3 generated stories[C] //Proc of the 3rd Workshop on Narrative Understanding. Stroudsburg, PA: ACL, 2021: 48−55
    [116]
    Abid A, Farooqi M, Zou J. Persistent anti-Muslim bias in large language models[C] //Proc of the 2021 AAAI/ACM Conf on AI, Ethics, and Society. New York: ACM, 2021: 298−306
    [117]
    Patel R, Pavlick E. Was it “stated” or was it “claimed”?: How linguistic bias affects generative language models[C] //Proc of the 26th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 10080−10095
    [118]
    Carlini N, Tramer F, Wallace E, et al. Extracting training data from large language models[C] //Proc of the 30th USENIX Security Symp (USENIX Security 21). Berkeley, CA: USENIX Association. 2021: 2633−2650
    [119]
    Carlini N. A LLM assisted exploitation of AI-guardian[J]. arXiv preprint, arXiv: 2307.15008, 2023
    [120]
    Li Jiazhao, Yang Yijin, Wu Zhuofeng, et al. ChatGPT as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger[J]. arXiv preprint, arXiv: 2304.14475, 2023
    [121]
    Staab R, Vero M, Balunović M, et al. Beyond memorization: violating privacy via inference with large language models[J]. arXiv preprint, arXiv: 2310.07298, 2023
    [122]
    Strubell E, Ganesh A, McCallum A. Energy and policy considerations for deep learning in NLP[C] //Proc of the 57th Annual Meeting of the ACL. Stroudsburg, PA: ACL, 2019: 3645−3650
    [123]
    Shumailov I, Zhao Y, Bates D, et al. Sponge examples: Energy-latency attacks on neural networks[C] //Proc of the 2021 IEEE European Symp on Security and Privacy (EuroS&P). Piscataway, NJ: IEEE, 2021: 212−231
    [124]
    Si W M, Backes M, Zhang Y, et al. Two-in-One: A model hijacking attack against text generation models[J]. arXiv preprint, arXiv: 2305.07406, 2023
    [125]
    Tirumala K, Markosyan A H, Zettlemoyer L, et al. Memorization without overfitting: Analyzing the training dynamics of large language models[J]. arXiv preprint, arXiv: 2205.10770, 2022
    [126]
    Carlini N, Chien S, Nasr M, et al. Membership inference attacks from first principles[C] // Proc of the 43rd IEEE Symp on Security and Privacy (SP). Piscataway, NJ: IEEE, 2022: 1897−1914
    [127]
    Fu Wenjie, Wang Huandong, Gao Chen, et al. Practical membership inference attacks against fine-tuned large language models via self-prompt calibration[J]. arXiv preprint, arXiv: 2311.06062, 2023
    [128]
    Kandpal N, Pillutla K, Oprea A, et al. User inference attacks on large language models[J]. arXiv preprint, arXiv: 2310.09266, 2023
    [129]
    Duan Haonan, Dziedzic A, Yaghini M, et al. On the privacy risk of in-context learning[J]. arXiv preprint, arXiv: 2411.10512, 2024
    [130]
    Lehman E, Jain S, Pichotta K, et al. Does BERT pretrained on clinical notes reveal sensitive data?[J]. arXiv preprint, arXiv: 2104.07762, 2021
    [131]
    Kim S, Yun S, Lee H, et al. ProPILE: Probing privacy leakage in large language models[J]. arXiv preprint, arXiv: 2307.01881, 2023
    [132]
    Nasr M, Carlini N, Hayase J, et al. Scalable extraction of training data from (production) language models[J]. arXiv preprint, arXiv: 2311.17035, 2023
    [133]
    Pan Xudong, Zhang Mi, Ji Shouling, et al. Privacy risks of general-purpose language models[C] //Proc of the 41st IEEE Symp on Security and Privacy (SP). Piscataway, NJ: IEEE, 2020: 1314−1331
    [134]
    Song Congzheng, Raghunathan A. Information leakage in embedding models[C] //Proc of the 27th ACM SIGSAC Conf on Computer and Communications Security. New York: ACM, 2020: 377−390
    [135]
    Li Haoran, Xu Mingshi, Song Yangqiu. Sentence embedding leaks more information than You expect: Generative embedding inversion attack to recover the whole sentence[J]. arXiv preprint, arXiv: 2305.03010, 2023
    [136]
    Zhang R, Hidano S, Koushanfar F. Text revealer: private text reconstruction via model inversion attacks against transformers[J]. arXiv preprint, arXiv: 2209.10505, 2022
    [137]
    Li Haoran, Guo Dadi, Fan Wei, et al. Multi-step jailbreaking privacy attacks on ChatGPT[J]. arXiv preprint, arXiv: 2304.05197, 2023
    [138]
    Shokri R, Stronati M, Song Congzheng, et al. Membership inference attacks against machine learning models[C] //Proc of the 38th IEEE Symp on Security and Privacy (SP). Piscataway, NJ: IEEE, 2017: 3−18
    [139]
    Mireshghallah F, Goyal K, Uniyal A, et al. Quantifying privacy risks of masked language models using membership inference attacks[C] //Proc of the 27th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2022: 8332−8347
    [140]
    Carlini N, Ippolito D, Jagielski M, et al. Quantifying memorization across neural language models[J]. arXiv preprint, arXiv: 2202.07646, 2023
    [141]
    Huang Jie, Shao Hanyin, Chang K C C. Are large pre-trained language models leaking your personal information?[J]. arXiv preprint, arXiv: 2205.12628, 2022
    [142]
    Gupta S, Huang Y, Zhong Z, et al. recovering private text in federated learning of language models[J]. arXiv preprint, arXiv: 2205.08514, 2022
    [143]
    Krishna K, Tomar G S, Parikh A P, et al. Thieves on sesame street! Model extraction of BERT-based APIs[J]. arXiv preprint, arXiv: 1910.12366, 2020
    [144]
    He Xuanli, Lyu Lingjuan, Xu Qiongkai, et al. Model extraction and adversarial transferability, your BERT is vulnerable![J]. arXiv preprint, arXiv: 2103.10013, 2021
    [145]
    Dziedzic A, Boenisch F, Jiang M, et al. Sentence embedding encoders are easy to steal but hard to defend[C]//Proc of the 11th ICLR Workshop on Pitfalls of limited data and computation for Trustworthy ML. Lausanne, Switzerland: ICLR Foundation. 2023: 2364−2378
    [146]
    Xu Qiongkai, He Xuanli, Lyu Lingjuan, et al. Student surpasses teacher: Imitation attack for black-box NLP APIs[J]. arXiv preprint, arXiv: 2108.13873, 2022
    [147]
    Zanella-Beguelin S, Tople S, Paverd A, et al. Grey-box extraction of natural language models[C] //Proc of the 38th Int Conf on Machine Learning. New York : PMLR, 2021: 12278−12286
    [148]
    Jiang Yi, Shi Chenghui, Ma Oubo, et al. Text laundering: Mitigating malicious features through knowledge distillation of large foundation models[C] //Proc of the 19th Int Conf on Information Security and Cryptology. Singapore: Springer Nature Singapore, 2023: 3−23
    [149]
    Gudibande A, Wallace E, Snell C, et al. The false promise of imitating proprietary LLMs[J]. arXiv preprint, arXiv: 2305.15717, 2023
    [150]
    Zhang Yiming, Carlini N, Ippolito D. Effective prompt extraction from language models[J]. arXiv preprint, arXiv: 2307.06865, 2024
    [151]
    Yang Yong, Zhang Xuhong,3 Jiang Yi, et al. PRSA: Prompt reverse stealing attacks against large language models[J]. arXiv preprint, arXiv: 2402.19200, 2024
    [152]
    Sha Z, Zhang Y. Prompt stealing attacks against large language models[J]. arXiv preprint, arXiv: 2402.12959, 2024
    [153]
    Morris J X, Zhao Wenting, Chiu J T, et al. Language model inversion[J]. arXiv preprint, arXiv: 2311.13647, 2023.
    [154]
    Cheng Yong, Jiang Lu, Macherey W, et al. AdvAug: Robust adversarial augmentation for neural machine translation[C] //Proc of the 58th Annual Meeting of the ACL. Stroudsburg, PA: ACL, 2020: 5961−5970
    [155]
    Minervini P, Riedel S. Adversarially regularising neural NLI models to integrate logical background knowledge[J]. arXiv preprint, arXiv: 1808.08609, 2018
    [156]
    Du Tianyu, Ji Shouling, Shen Lujia, et al. Cert-RNN: Towards certifying the robustness of recurrent neural networks[C] //Proc of the 28th ACM SIGSAC Conf on Computer and Communications Security. New York: ACM, 2021: 516−534
    [157]
    Qi Fanchao, Chen Yangyi, Li Mukai, et al. ONION: A simple and effective defense against textual backdoor attacks[C] //Proc of the 26th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 9558−9566
    [158]
    Azizi A, Tahmid I A, Waheed A, et al. T-Miner: A generative approach to defend against trojan attacks on DNN-based text classification[C] //Proc of the 30th USENIX Security Symp (USENIX Security 21). Berkeley, CA: USENIX Association, 2021: 2255−2272vv
    [159]
    Li Yige, Lyu Xixiang, Koren N, et al. Neural attention distillation: erasing backdoor triggers from deep neural networks[J]. arXiv preprint, arXiv: 2101.05930, 2021
    [160]
    Liu Kang, Dolan-Gavitt B, Garg S. Fine-pruning: Defending against backdooring attacks on deep neural networks[J]. arXiv preprint, arXiv: 1805.12185, 2018
    [161]
    Xu Chang, Wang Jun, Guzmán F, et al. Mitigating data poisoning in text classification with differential privacy[C] // Proc of the 26th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 4348−4356
    [162]
    Chen Chuanshuai, Dai Jiazhu. Mitigating backdoor attacks in LSTM-based text classification systems by backdoor keyword identification[J]. Neurocomputing, 2021, 452: 253−262 doi: 10.1016/j.neucom.2021.04.105
    [163]
    Wang Yuxia, Li Haonan, Han Xudong, et al. Do-Not-Answer: A dataset for evaluating safeguards in LLMs[J]. arXiv preprint, arXiv: 2308.13387, 2023
    [164]
    Zhang Mi, Pan Xudong, Yang Min. JADE: A linguistics-based safety evaluation platform for large language models[J]. arXiv preprint, arXiv: 2311.00286, 2023
    [165]
    Wang Mengru, Zhang Ningyu, Xu Ziwen, et al. Detoxifying large language models via knowledge editing[J]. arXiv preprint, arXiv: 2403.14472, 2024
    [166]
    Chen Jiefeng, Yoon J, Ebrahimi S, et al. Adaptation with self-evaluation to improve selective prediction in LLMs[J]. arXiv preprint, arXiv: 2310.11689, 2023
    [167]
    Feldman P, Foulds J R, Pan S. Trapping LLM hallucinations using tagged context prompts[J]. arXiv preprint, arXiv: 2306.06085, 2023
    [168]
    Duan Haonan, Dziedzic A, Papernot N, et al. Flocks of stochastic parrots: Differentially private prompt learning for large language models[J]. Advances in Neural Information Processing Systems, 2023, 36: 76852−76871
    [169]
    Yu Da, Naik S, Backurs A, et al. Differentially private fine-tuning of language models[J]. arXiv preprint, arXiv: 2110.06500, 2022
    [170]
    Mattern J, Jin Zhijing, Weggenmann B, et al. Differentially private language models for secure data sharing[J]. arXiv preprint, arXiv: 2210.13918, 2022
    [171]
    Liu Xuanqi, Liu Zhuotao. LLMs can understand encrypted prompt: Towards privacy-computing friendly transformers[J]. arXiv preprint, arXiv: 2305.18396, 2023
    [172]
    Li Yansong, Tan Zhixing, Liu Yang. Privacy-preserving prompt tuning for large language model services[J]. arXiv preprint, arXiv: 2305.06212, 2023
    [173]
    Yan Ran, Li Yujun, Li Wenqian, et al. Teach large language models to forget privacy[J]. arXiv preprint, arXiv: 2401.00870, 2023
    [174]
    Ippolito D, Tramèr F, Nasr M, et al. Preventing verbatim memorization in language models gives a false sense of privacy[J]. arXiv preprint, arXiv: 2210.17546, 2023
    [175]
    Liu Sijia, Yao Yuanshun, Jia Jinghan, et al. Rethinking machine unlearning for large language models[J]. arXiv preprint, arXiv: 2402.08787, 2024
    [176]
    Jang J, Yoon D, Yang S, et al. Knowledge unlearning for mitigating privacy risks in language models[C] //Proc of the 61st Annual Meeting of the ACL (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2023: 14389−14408
    [177]
    Chen Jiaao, Yang Diyi. Unlearn what you want to forget: Efficient unlearning for LLMs[C] //Proc of the 28th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 12041−12052
    [178]
    Maini P, Feng Z, Schwarzschild A, et al. TOFU: A task of fictitious unlearning for LLMs[J]. arXiv preprint, arXiv: 2401.06121, 2024
    [179]
    Zhang Ruiqi, Lin Licong, Bai Yu, et al. Negative preference optimization: From catastrophic collapse to effective unlearning[J]. arXiv preprint, arXiv: 2404.05868, 2024
    [180]
    Tian Bozhong, Liang Xiaozhuan, Cheng Siyuan, et al. To forget or not? Towards practical knowledge unlearning for large language models[C] // Proc of the 29th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2024: 1524−1537
    [181]
    Muresanu A, Thudi A, Zhang M R, et al. Unlearnable algorithms for in-context learning[J]. arXiv preprint, arXiv: 2402.00751, 2024
    [182]
    Cohen R, Biran E, Yoran O, et al. Evaluating the ripple effects of knowledge editing in language models[J]. Transactions of the ACL, 2024, 12: 283−298
    [183]
    Yan Jianhao, Wang Futing, Li Yafu, et al. Potential and challenges of model editing for social debiasing[J]. arXiv preprint, arXiv: 2402.13462, 2024
    [184]
    Liu C Y, Wang Yaxuan, Flanigan J, et al. Large language model unlearning via embedding-corrupted prompts[J]. arXiv preprint, arXiv: 2406.07933, 2024
    [185]
    Wu Xinwei, Li Junzhuo, Xu Minghui, et al. DEPN: Detecting and editing privacy neurons in pretrained language models[C] //Proc of the 28th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 2875−2886
    [186]
    Wu Xinwei, Dong Weilong, Xu Shaoyang, et al. Mitigating privacy seesaw in large language models: Augmented privacy neuron editing via activation patching[C] //proc of the 62nd ACL. Stroudsburg, PA: ACL, 2024: 5319−5332
    [187]
    Venditti D, Ruzzetti E S, Xompero G A, et al. Enhancing data privacy in large language models through private association editing[J]. arXiv preprint, arXiv: 2406.18221, 2024
  • Cited by

    Periodical cited type(18)

    1. 徐宁,李静秋,王岚君,刘安安. 时序特性引导下的谣言事件检测方法评测. 南京大学学报(自然科学). 2025(01): 71-82 .
    2. 崔蒙蒙,刘井平,阮彤,宋雨秋,杜渂. 基于双重多视角表示的目标级隐性情感分类. 计算机工程. 2024(01): 79-90 .
    3. 张乐怡,周怡洁,俞定国,闫燕勤. 媒介变迁下的谣言传播研究. 新媒体研究. 2024(14): 12-16 .
    4. 王世雄,吴泽政. 基于异质信息网络表征学习的微博虚假信息甄别研究. 情报杂志. 2024(12): 152-160 .
    5. 陈雄逸,许力,张欣欣,尤玮婧. 社交网络基于意见领袖的谣言抑制方案. 信息安全研究. 2023(01): 57-65 .
    6. 张欣欣 ,许力 ,徐振宇 . 基于网络模体的移动社会网络信息可控传播方法. 电子与信息学报. 2023(02): 635-643 .
    7. 杨晓晖,王卫宾. 基于门控图神经网络的谣言检测模型. 燕山大学学报. 2023(01): 73-81 .
    8. 孙书魁,范菁,李占稳,曲金帅,路佩东. 人工智能在新型冠状病毒肺炎中的研究综述. 计算机工程与应用. 2023(05): 28-39 .
    9. 陈卓敏,王莉,朱小飞,王子康. 基于对抗图增强对比学习的虚假新闻检测. 中文信息学报. 2023(06): 137-146 .
    10. 鲁贻锦,吴蕾. 基于大数据驱动技术的媒体风险感知模型研究. 佳木斯大学学报(自然科学版). 2023(06): 52-56 .
    11. 许云红,崔乐靖,朱南丽,郑娜娜. 社交媒体用户谣言传播行为的影响因素研究综述. 新媒体研究. 2023(24): 14-17+33 .
    12. 龙小农,靳旭鹏. 新冠疫情、信息疫情与政治疫情的互动关系及作用机制. 现代传播(中国传媒大学学报). 2022(02): 66-76 .
    13. 杨秀璋,刘建义,任天舒,宋籍文,武帅,姜婧怡,陈登建,周既松,李娜. 基于改进LDA-CNN-BiLSTM模型的社交媒体情感分析研究. 现代计算机. 2022(02): 29-36 .
    14. 张放,范琳琅. 公共危机中社交媒体辟谣信息采纳的关键要素探究——基于新冠疫情微博辟谣文本的计算分析. 新闻界. 2022(10): 75-85 .
    15. 朱梦蝶,付少雄,郑德俊,李杨. 文献视角下的社交媒体健康谣言研究:特征、传播与治理. 图书情报知识. 2022(05): 131-143 .
    16. 肖喜珠,杨闻远,高慧敏,高世奇,郭书恒,路思玲,聂欣政,任书漫,王一民,温馨. “后真相”时代的风险感知与反击:青年社交媒体用户信息行为研究. 新媒体研究. 2022(21): 40-46 .
    17. 徐建民,王恺霖,吴树芳. 基于改进D-S证据理论的微博不可信用户识别研究. 数据分析与知识发现. 2022(12): 99-112 .
    18. 周晖. 国内外基于社交媒体的社会情绪对比分析. 中华医学图书情报杂志. 2022(12): 65-69 .

    Other cited types(21)

Catalog

    Article views (127) PDF downloads (54) Cited by(39)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return