Survey of Multimodal Vision-Language Representation Learning Models and Their Adversarial Examples Attack and Defense Techniques

Zeng Cheng; Ge Yunjie; Zhao Lingchen; Wang Qian

doi:10.7544/issn1000-1239.202550410

Zeng Cheng, Ge Yunjie, Zhao Lingchen, Wang Qian. Survey of Multimodal Vision-Language Representation Learning Models and Their Adversarial Examples Attack and Defense TechniquesJ. Journal of Computer Research and Development, 2025, 62(9): 2208-2232. DOI: 10.7544/issn1000-1239.202550410

Citation:

Survey of Multimodal Vision-Language Representation Learning Models and Their Adversarial Examples Attack and Defense Techniques

Graphical Abstract

Abstract

Abstract

With the rapid development of computer vision, natural language processing, and deep learning technologies, multimodal vision-language representation learning models have demonstrated outstanding performance in tasks such as image captioning, text-to-image generation, and visual question answering. Such models have consequently become a focus of research in both academia and industry. However, the multimodal nature and complexity of these models provide attackers with more diverse attack vectors. Attackers can craft adversarial examples to mislead these models into outputting incorrect, harmful, or false content, posing an increasingly serious security threat to such models. This paper systematically reviews the current state of research on multimodal vision-language models. It also categorizes and summarizes the adversarial examples-based attack methods that have emerged in recent years against these models, as well as their defense strategies. Furthermore, it provides a detailed overview of the fundamental principles, implementation approaches, and conclusions of relevant studies. On this basis, this paper discusses the current state and future directions of security research in multimodal vision-language representation learning. Finally, the paper envisions the future prospects of applying vision-language representation learning technology in combination with interpretability techniques.

FullText(HTML)

References (137)

Cited By

Turn off MathJax

Article Contents

Survey of Multimodal Vision-Language Representation Learning Models and Their Adversarial Examples Attack and Defense Techniques

Abstract

Catalog

Export File

Citation

Format

Content