大语言模型对抗性攻击与防御综述

台建玮; 杨双宁; 王佳佳; 李亚凯; 刘奇旭; 贾晓启

doi:10.7544/issn1000-1239.202440630

大语言模型对抗性攻击与防御综述

Survey of Adversarial Attacks and Defenses for Large Language Models

摘要

摘要: 随着自然语言处理与深度学习技术的快速发展，大语言模型在文本处理、语言理解、图像生成和代码审计等领域中的应用不断深入，成为了当前学术界与工业界共同关注的研究热点. 然而，攻击者可以通过对抗性攻击手段引导大语言模型输出错误的、不合伦理的或虚假的内容，使得大语言模型面临的安全威胁日益严峻. 对近年来针对大语言模型的对抗性攻击方法和防御策略进行总结，详细梳理了相关研究的基本原理、实施方法与研究结论. 在此基础上，对提示注入攻击、间接提示注入攻击、越狱攻击和后门攻击这4类主流的攻击模式进行了深入的技术探讨. 更进一步地，对大语言模型安全的研究现状与未来方向进行了探讨，并展望了大语言模型结合多模态数据分析与集成等技术的应用前景.

Abstract: With the rapid development of natural language processing and deep learning technologies, large language models (LLMs) have been increasingly applied in various fields such as text processing, language understanding, image generation, and code auditing. These models have become a research hotspot of common interest in both academia and industry. However, adversarial attack methods allow attackers to manipulate large language models into generating erroneous, unethical, or false content, posing increasingly severe security threats to these models and their wide-ranging applications. This paper systematically reviews recent advancements in adversarial attack methods and defense strategies for large language models. It provides a detailed summary of fundamental principles, implementation techniques, and major findings from relevant studies. Building on this foundation, the paper delves into technical discussions of four mainstream attack modes: prompt injection attacks, indirect prompt injection attacks, jailbreak attacks, and backdoor attacks. Each is analyzed in terms of its mechanisms, impacts, and potential risks. Furthermore, the paper discusses the current research status and future directions of large language models security, and outlooks the application prospects of large language models combined with multimodal data analysis and integration technologies. This review aims to enhance understanding of the field and foster more secure, reliable applications of large language models.

HTML全文

参考文献(117)

施引文献

资源附件(0)