Abstract:
With the rapid development of computer vision, natural language processing, and deep learning technologies, multimodal vision-language representation learning models have demonstrated outstanding performance in tasks such as image captioning, text-to-image generation, and visual question answering. Such models have consequently become a focus of research in both academia and industry. However, the multimodal nature and complexity of these models provide attackers with more diverse attack vectors. Attackers can craft adversarial examples to mislead these models into outputting incorrect, harmful, or false content, posing an increasingly serious security threat to such models. This paper systematically reviews the current state of research on multimodal vision-language models. It also categorizes and summarizes the adversarial examples-based attack methods that have emerged in recent years against these models, as well as their defense strategies. Furthermore, it provides a detailed overview of the fundamental principles, implementation approaches, and conclusions of relevant studies. On this basis, this paper discusses the current state and future directions of security research in multimodal vision-language representation learning. Finally, the paper envisions the future prospects of applying vision-language representation learning technology in combination with interpretability techniques.