一种基于双重语义协作网络的图像描述方法

江泽涛; 朱文才; 金鑫; 廖培期; 黄景帆

doi:10.7544/issn1000-1239.202330523

摘要: CLIP（contrastive language-image pre-training）视觉编码器提取的网格特征作为一种更加靠近文本域的视觉特征，具有易转化为对应语义自然语言的特点，可以缓解语义鸿沟问题，因而未来可能成为图像描述任务中视觉特征的重要来源. 但该方法中未考虑图像内容的划分，可能使一个完整的目标被划分到若干个网格中，目标被切割势必会导致特征提取结果中缺少对目标信息的完整表达，进而导致生成的描述语句中缺少对目标及目标间关系的准确表述. 针对CLIP视觉编码器提取网格特征这一现象，提出一种基于双重语义协作网络（dual semantic collaborative network，DSC-Net）的图像描述方法. 具体来说：首先提出双重语义协作自注意力（dual semantic collaborative self-attention，DSCS）模块增强CLIP网格特征对目标信息的表达能力；接着提出双重语义协作交叉注意力（dual semantic collaborative cross-attention，DSCC）模块，综合网格和目标2个层面的语义构造与文本相关的视觉特征，进行描述语句预测；最后提出双重语义融合（dual semantic fusion，DSF）模块，为上述的2个语义协作模块提供以区域为主导的融合特征，解决在语义协作过程中可能出现的相关性冲突问题. 经过在COCO数据集上的大量实验，提出的模型在Karpathy等人划分的离线测试集上取得了138.5%的CIDEr分数，在官方在线测试中取得了137.6%的CIDEr分数，与目前主流的图像描述方法相比具有显著优势.

Abstract: As visual features closer to the text domain, the grid features extracted by the CLIP (contrastive language-image pre-training) image encoder are easy to convert into the corresponding semantic natural language, which can alleviate the semantic gap problem, so it may become an important source of visual features in the image captioning in the future. However, this method does not consider that the division of image content may cause a complete object to be divided into several grids. The segmentation of the objects will inevitably lead to the lack of a complete expression of the object information in the feature extraction results, and further lead to the lack of an accurate expression of the object and the relationship between the objects in the generated sentence. Aiming at the phenomenon of grid features extracted by CLIP image encoder, we propose dual semantic collaborative network (DSC-Net) for image captioning. Specifically, dual semantic collaborative self-attention (DSCS) module is first proposed to enhance the expression of object information by CLIP grid features. Then dual semantic collaborative cross-attention (DSCC) module is proposed to integrate semantic information between grid and object to generate visual features, and to be used to predict sentences. Finally, dual semantic fusion (DSF) module is proposed to provide region-oriented fusion features for the above two semantic cooperation modules, and to solve the problem of correlation conflicts that may arise in the process of semantic cooperation. After a large number of experiments on the COCO dataset, the proposed model achieves a CIDEr score of 138.5% on the offline test set divided by Karpathy et al., and a CIDEr score of 137.6% in the official online test. Compared with the current mainstream image captioning methods, this result has obvious advantages.

一种基于双重语义协作网络的图像描述方法

An Image Captioning Method Based on DSC-Net