多模态视觉语言表征学习模型及其对抗样本攻防技术综述

曾诚; 葛云洁; 赵令辰; 王骞

doi:10.7544/issn1000-1239.202550410

多模态视觉语言表征学习模型及其对抗样本攻防技术综述

Survey of Multimodal Vision-Language Representation Learning Models and Their Adversarial Examples Attack and Defense Techniques

摘要

摘要: 随着计算机视觉、自然语言处理与深度学习技术的快速发展，多模态视觉语言表征学习模型在图像描述、文本生成图像、视觉问答等任务中展现出了卓越的性能，已成为当前学术界与工业界共同关注的研究热点. 然而，这类模型的多模态特性和复杂性为攻击者提供了更加多样的攻击途径，攻击者可以通过对抗样本引导模型输出错误的、有害的或虚假的内容，使该类模型面临的安全威胁日益严峻. 系统地梳理了多模态视觉语言模型的研究现状，同时，对近年来出现的针对该类模型的对抗样本攻击方法及其防御策略进行了分类总结，详细归纳了相关研究的基本原理、实施方法与研究结论. 在此基础上，对多模态视觉语言表征学习的安全研究现状与未来方向进行了探讨，并展望了视觉语言表征学习技术在未来结合可解释性技术的应用前景.

Abstract: With the rapid development of computer vision, natural language processing, and deep learning technologies, multimodal vision-language representation learning models have demonstrated outstanding performance in tasks such as image captioning, text-to-image generation, and visual question answering. Such models have consequently become a focus of research in both academia and industry. However, the multimodal nature and complexity of these models provide attackers with more diverse attack vectors. Attackers can craft adversarial examples to mislead these models into outputting incorrect, harmful, or false content, posing an increasingly serious security threat to such models. This paper systematically reviews the current state of research on multimodal vision-language models. It also categorizes and summarizes the adversarial examples-based attack methods that have emerged in recent years against these models, as well as their defense strategies. Furthermore, it provides a detailed overview of the fundamental principles, implementation approaches, and conclusions of relevant studies. On this basis, this paper discusses the current state and future directions of security research in multimodal vision-language representation learning. Finally, the paper envisions the future prospects of applying vision-language representation learning technology in combination with interpretability techniques.

HTML全文

参考文献(137)

施引文献

资源附件(0)