Abstract:
With the rapid development of natural language processing and deep learning technologies, large language models have been increasingly applied in various fields such as text processing, language understanding, image generation, and code auditing. These models have become a research hotspot of common interest in both academia and industry. However, adversarial attack methods allow attackers to manipulate large language models into generating erroneous, unethical, or false content, posing increasingly severe security threats to these models and their wide-ranging applications. This paper systematically reviews recent advancements in adversarial attack methods and defense strategies for large language models. It provides a detailed summary of fundamental principles, implementation techniques, and major findings from relevant studies. Building on this foundation, the paper delves into technical discussions of four mainstream attack modes: prompt injection attacks, indirect prompt injection attacks, jailbreak attacks, and backdoor attacks. Each is analyzed in terms of its mechanisms, impacts, and potential risks. Furthermore, the paper discusses the current research status and future directions of large language models security, and outlooks the application prospects of large language models combined with multimodal data analysis and integration technologies. This review aims to enhance understanding of the field and foster more secure, reliable applications of large language models.