基于CLIP生成多事件表示的视频文本检索方法

涂荣成; 毛先领; 孔伟杰; 蔡成飞; 赵文哲; 王红法; 黄河燕

doi:10.7544/issn1000-1239.202220440

基于CLIP生成多事件表示的视频文本检索方法

CLIP Based Multi-Event Representation Generation for Video-Text Retrieval

摘要

摘要: 视频-文本检索作为一项被广泛应用于现实生活中的多模态检索技术受到越来越多的研究者的关注. 近来, 大部分视频文本工作通过利用大规模预训练模型中所学到的视觉与语言之间的匹配关系来提升文本视频间跨模态检索效果. 然而, 这些方法忽略了视频、文本数据都是由一个个事件组合而成. 倘若能捕捉视频事件与文本事件之间的细粒度相似性关系, 将能帮助模型计算出更准确的文本与视频之间的语义相似性关系, 进而提升文本视频间跨模态检索效果. 因此, 提出了一种基于CLIP生成多事件表示的视频文本检索方法(CLIP based multi-event representation generation for video-text retrieval, CLIPMERG). 首先, 通过利用大规模图文预训练模型CLIP的视频编码器(ViT)以及文本编码器(Tansformer)分别将视频、文本数据转换成视频帧token序列以及文本的单词token序列；然后, 通过视频事件生成器(文本事件生成器)将视频帧token序列(单词token序列)转换成k个视频事件表示(k个文本事件表示)；最后, 通过挖掘视频事件表示与文本事件表示之间的细粒度关系以定义视频、文本间的语义相似性关系. 在3个常用的公开视频文本检索数据集MSR-VTT, DiDeMo, LSMDC上的实验结果表明所提的CLIPMERG优于现有的视频文本检索方法.

Abstract: Video-text retrieval has been widely used in many real-world applications and attracted more and more research attention. Recently, many work has been proposed to leverage the visual-language matching knowledge of the pre-training models to further improve the retrieval performance. However, these methods ignore that video and text data are composed of events. If the fine-grained similarities between events in video and events in text can be captured well, it will help to calculate more accurate semantic similarities between texts and videos, and then improve the retrieval performance. Hence, in this paper, we propose a CLIP based multi-event representation generation for video-text retrieval, called CLIPMERG. Specifically, CLIPMERG first utilizes the video encoder and text encoder of pre-training model CLIP to transform the video and text inputs into video frame token sequences and word token sequences, respectively. Next, CLIPMERG uses a video (text) event generator to map the video frame (text word) token sequence into k video (text) event representations. Finally, CLIPMERG calculates the semantic similarities between videos and texts through capturing the fine-grained similarities between video event representations and text event representations. Extensive experimental results on three widely used benchmark datasets MSR-VTT, DiDeMo and LSMDC show that our proposed CLIPMERG achieves better performance than state-of-the-art baselines on the video-text retrieval tasks.

HTML全文

参考文献(31)

施引文献

资源附件(0)