Abstract:
As visual features closer to the text domain, the grid features extracted by the CLIP (contrastive language-image pre-training) image encoder are easy to convert into the corresponding semantic natural language, which can alleviate the semantic gap problem, so it may become an important source of visual features in the image captioning in the future. However, this method does not consider that the division of image content may cause a complete object to be divided into several grids. The segmentation of the objects will inevitably lead to the lack of a complete expression of the object information in the feature extraction results, and further lead to the lack of an accurate expression of the object and the relationship between the objects in the generated sentence. Aiming at the phenomenon of grid features extracted by CLIP image encoder, we propose dual semantic collaborative network (DSC-Net) for image captioning. Specifically, dual semantic collaborative self-attention (DSCS) module is first proposed to enhance the expression of object information by CLIP grid features. Then dual semantic collaborative cross-attention (DSCC) module is proposed to integrate semantic information between grid and object to generate visual features, and to be used to predict sentences. Finally, dual semantic fusion (DSF) module is proposed to provide region-oriented fusion features for the above two semantic cooperation modules, and to solve the problem of correlation conflicts that may arise in the process of semantic cooperation. After a large number of experiments on the COCO dataset, the proposed model achieves a CIDEr score of 138.5% on the offline test set divided by Karpathy et al., and a CIDEr score of 137.6% in the official online test. Compared with the current mainstream image captioning methods, this result has obvious advantages.