Advanced Search
    Zeng Cheng, Ge Yunjie, Zhao Lingchen, Wang Qian. Survey of Multimodal Vision-Language Representation Learning Models and Their Adversarial Examples Attack and Defense Techniques[J]. Journal of Computer Research and Development, 2025, 62(9): 2208-2232. DOI: 10.7544/issn1000-1239.202550410
    Citation: Zeng Cheng, Ge Yunjie, Zhao Lingchen, Wang Qian. Survey of Multimodal Vision-Language Representation Learning Models and Their Adversarial Examples Attack and Defense Techniques[J]. Journal of Computer Research and Development, 2025, 62(9): 2208-2232. DOI: 10.7544/issn1000-1239.202550410

    Survey of Multimodal Vision-Language Representation Learning Models and Their Adversarial Examples Attack and Defense Techniques

    • With the rapid development of computer vision, natural language processing, and deep learning technologies, multimodal vision-language representation learning models have demonstrated outstanding performance in tasks such as image captioning, text-to-image generation, and visual question answering. Such models have consequently become a focus of research in both academia and industry. However, the multimodal nature and complexity of these models provide attackers with more diverse attack vectors. Attackers can craft adversarial examples to mislead these models into outputting incorrect, harmful, or false content, posing an increasingly serious security threat to such models. This paper systematically reviews the current state of research on multimodal vision-language models. It also categorizes and summarizes the adversarial examples-based attack methods that have emerged in recent years against these models, as well as their defense strategies. Furthermore, it provides a detailed overview of the fundamental principles, implementation approaches, and conclusions of relevant studies. On this basis, this paper discusses the current state and future directions of security research in multimodal vision-language representation learning. Finally, the paper envisions the future prospects of applying vision-language representation learning technology in combination with interpretability techniques.
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return