一种基于双重语义协作网络的图像描述方法

江泽涛; 朱文才; 金鑫; 廖培期; 黄景帆

doi:10.7544/issn1000-1239.202330523

一种基于双重语义协作网络的图像描述方法

广西图像图形智能处理重点实验室（桂林电子科技大学）　广西桂林　541004

基金项目: 国家自然科学基金项目（62172118）；广西自然科学基金重点项目（2021GXNSFDA196002）；广西图像图形智能处理重点实验项目（GIIP2302，GIIP2303，GIIP2304）；广西研究生教育创新计划项目（YCSW2022269）；桂林电子科技大学研究生教育创新计划项目（2023YCXS046）

详细信息

作者简介:
江泽涛: 1961年生. 博士，教授. 主要研究方向为图像处理、计算机视觉、人工智能

朱文才: 1999年生. 硕士. 主要研究方向为图像处理、计算机视觉

金鑫: 1998年生. 硕士. 主要研究方向为图像处理、计算机视觉

廖培期: 1995年生. 硕士. 主要研究方向为图像处理、计算机视觉

黄景帆: 1999年生. 硕士. 主要研究方向为图像处理、计算机视觉

通讯作者:
朱文才（zhuwencai00@qq.com）

中图分类号: TP391.41
计量
- 文章访问数: 126
- HTML全文浏览量: 48
- PDF下载量: 66
出版历程
- 收稿日期: 2023-06-18
- 修回日期: 2023-12-18
- 网络出版日期: 2024-05-28
- 刊出日期: 2024-10-31

An Image Captioning Method Based on DSC-Net

Guangxi Key Laboratory of Image and Graphic Intelligent Processing (Guilin University of Electronic Technology), Guilin, Guangxi 541004

Funds: This work was supported by the National Natural Science Foundation of China (62172118), the Guangxi Natural Science Key Foundation (2021GXNSFDA196002), the Project of Guangxi Key Laboratory of Image and Graphic Intelligent Processing (GIIP2302, GIIP2303, GIIP2304), the Innovation Project of Guangxi Graduate Education (YCSW2022269), and the Innovation Project of GUET Graduate Education (2023YCXS046)

More Information

Author Bio:
Jiang Zetao: born in 1961. PhD, professor. His main research interests include image processing, computer vision, and artificial intelligence

Zhu Wencai: born in 1999. Master. His main research interests include image processing and computer vision

Jin Xin: born in 1998. Master. His main research interests include image processing and computer vision

Liao Peiqi: born in 1995. Master. His main research interests include image processing and computer vision

Huang Jingfan: born in 1999. Master. His main research interests include image processing and computer vision

摘要

摘要:
CLIP（contrastive language-image pre-training）视觉编码器提取的网格特征作为一种更加靠近文本域的视觉特征，具有易转化为对应语义自然语言的特点，可以缓解语义鸿沟问题，因而未来可能成为图像描述任务中视觉特征的重要来源. 但该方法中未考虑图像内容的划分，可能使一个完整的目标被划分到若干个网格中，目标被切割势必会导致特征提取结果中缺少对目标信息的完整表达，进而导致生成的描述语句中缺少对目标及目标间关系的准确表述. 针对CLIP视觉编码器提取网格特征这一现象，提出一种基于双重语义协作网络（dual semantic collaborative network，DSC-Net）的图像描述方法. 具体来说：首先提出双重语义协作自注意力（dual semantic collaborative self-attention，DSCS）模块增强CLIP网格特征对目标信息的表达能力；接着提出双重语义协作交叉注意力（dual semantic collaborative cross-attention，DSCC）模块，综合网格和目标2个层面的语义构造与文本相关的视觉特征，进行描述语句预测；最后提出双重语义融合（dual semantic fusion，DSF）模块，为上述的2个语义协作模块提供以区域为主导的融合特征，解决在语义协作过程中可能出现的相关性冲突问题. 经过在COCO数据集上的大量实验，提出的模型在Karpathy等人划分的离线测试集上取得了138.5%的CIDEr分数，在官方在线测试中取得了137.6%的CIDEr分数，与目前主流的图像描述方法相比具有显著优势.
- 图像描述 /
- 网格特征 /
- 注意力机制 /
- 双重语义协作注意力 /
- 双重语义协作特征融合
Abstract:
As visual features closer to the text domain, the grid features extracted by the CLIP (contrastive language-image pre-training) image encoder are easy to convert into the corresponding semantic natural language, which can alleviate the semantic gap problem, so it may become an important source of visual features in the image captioning in the future. However, this method does not consider that the division of image content may cause a complete object to be divided into several grids. The segmentation of the objects will inevitably lead to the lack of a complete expression of the object information in the feature extraction results, and further lead to the lack of an accurate expression of the object and the relationship between the objects in the generated sentence. Aiming at the phenomenon of grid features extracted by CLIP image encoder, we propose dual semantic collaborative network (DSC-Net) for image captioning. Specifically, dual semantic collaborative self-attention (DSCS) module is first proposed to enhance the expression of object information by CLIP grid features. Then dual semantic collaborative cross-attention (DSCC) module is proposed to integrate semantic information between grid and object to generate visual features, and to be used to predict sentences. Finally, dual semantic fusion (DSF) module is proposed to provide region-oriented fusion features for the above two semantic cooperation modules, and to solve the problem of correlation conflicts that may arise in the process of semantic cooperation. After a large number of experiments on the COCO dataset, the proposed model achieves a CIDEr score of 138.5% on the offline test set divided by Karpathy et al., and a CIDEr score of 137.6% in the official online test. Compared with the current mainstream image captioning methods, this result has obvious advantages.
- image captioning /
- grid features /
- attention mechanism /
- dual semantic collaborative attention /
- dual semantic collaborative feature fusion

HTML全文

图 1 DSC-Net结构图

Figure 1. DSC-Net structure diagram

下载: 全尺寸图片幻灯片

图 2 DSCS结构图

Figure 2. DSCS structure diagram

下载: 全尺寸图片幻灯片

图 3 DSF结构图

Figure 3. DSF structure diagram

下载: 全尺寸图片幻灯片

图 4 DSCC结构图

Figure 4. DSCC structure diagram

下载: 全尺寸图片幻灯片

图 5 超参数对模型性能的影响

Figure 5. Impact of hyperparameter on performance of model

下载: 全尺寸图片幻灯片

图 6 DSC-Net与基线的预测结果对比

Figure 6. Comparison of prediction results between DSC-Net and baseline

下载: 全尺寸图片幻灯片

表 1 本文方法在离线测试中与各种典型方法的性能比较

Table 1 Performance Comparison of Our Method and Various Typical Methods in Offline Test %

方法	BLEU- 1	BLEU- 4	METEOR	ROUGE	CIDEr	SPICE
up-down^[5](CVPR’18)	79.8	36.3	27.7	56.9	120.1	21.4
Transformer^[17](ACL’18)	80.5	39.2	29.1	58.7	130.0	23.0
SGAE^[12](CVPR’19)	80.8	38.4	27.4	58.6	127.8	22.1
M²Transformer^[13] (CVPR’20)	80.8	39.1	29.2	58.6	131.2	22.6
X-Transformer^[8] (CVPR’20)	80.9	39.7	29.5	59.1	132.8	23.4
GET^[14](AAAI’21)	81.5	39.5	29.3	58.9	131.6	22.8
APN^[15] (ICCV’21)		39.6	29.2	59.1	131.8	23.0
RSTNet^[18](CVPR’21)	81.1	39.3	29.4	58.8	133.3	23.0
ReFormer^[19](MM’22)	82.3	39.8	29.7	59.8	131.9	23.0
GAT^[20](Expert Syst. Appl.’22)	80.8	39.7	29.1	59.0	130.5	22.9
CLIP-ViL^[9](ICLR’22)		40.2	29.7		134.2	23.8
up-down^*	81.3	39.4	29.2	59.3	131.9	22.8
Transformer^*	81.6	40.6	29.9	59.8	136.2	23.9
RSTNet^*	82.0	40.4	30.0	59.7	137.0	23.7
DSC-Net(本文)	82.9	41.1	30.1	59.9	138.5	23.7
注：粗体数值表示最优值.带“*”的方法为本文复现的结果，在复现中采用了与本文方法相同的环境，因此对比更加公平.

下载: 导出CSV

表 2 本文方法在在线测试中与各种典型方法的性能比较

Table 2 Performance Comparison of Our Method and Various Typical Methods in Online Test %

方法	BLEU-1		BLEU-2		BLEU-3		BLEU-4		METEOR		ROUGE		CIDEr
方法	c5	c40	c5	c40	c5	c40	c5	c40	c5	c40	c5	c40	c5	c40
up-down^[5](CVPR’18)	80.2	95.2	64.1	88.8	49.1	79.4	36.9	68.5	27.6	36.7	57.1	72.4	117.9	120.5
SGAE^[12](CVPR’19)	81.0	95.3	65.6	89.5	50.7	80.4	38.5	69.7	28.2	37.2	58.6	73.6	123.8	126.5
AoANet^[21](ICCV’19)	81.0	95.0	65.8	89.6	51.4	81.3	39.4	71.2	29.1	38.5	58.9	74.5	126.9	129.6
X-Transformer^[8](CVPR’20)	81.9	95.7	66.9	90.5	52.4	82.5	40.3	72.4	29.6	39.2	59.5	75.0	131.1	133.5
M²Transformer^[13](CVPR’20)	81.6	96.0	66.4	90.8	51.8	82.7	39.7	72.8	29.4	39.0	59.2	74.8	129.3	132.1
GET^[14](AAAI’21)	81.6	96.1	66.5	90.9	51.9	82.8	39.7	72.9	29.4	38.8	59.1	74.4	130.3	132.5
RSTNet^[18](CVPR’21)	82.1	96.4	67.0	91.3	52.2	83.0	40.0	73.1	29.6	39.1	59.5	74.6	131.9	134.0
ReFormer^[19](MM’22)	82.0	96.7					40.1	73.2	29.8	39.5	59.9	75.2	129.9	132.8
GAT^[20](Expert Syst.Appl’22)	81.1	95.1	66.1	89.7	51.8	81.5	39.9	71.4	29.1	38.4	59.1	74.4	127.8	129.8
RSTNet^[18](单一模型)	81.8	96.2	66.6	91.0	51.9	82.7	39.7	72.5	29.7	39.0	59.4	74.5	131.4	134.4
DSC-Net(本文单一模型)	81.9	96.3	66.6	91.1	51.9	82.8	39.7	72.8	29.7	39.2	59.3	74.4	132.7	135.3
DSC-Net(本文集成模型)	82.8	96.9	67.8	92.3	53.2	84.5	40.9	74.7	30.1	39.6	60.1	75.3	135.8	137.6
注：粗体数值表示最优值.

下载: 导出CSV

表 3 关于模块的消融结果

Table 3 Ablation Results on Modules %

网络结构				CIDEr
Baseline	DSCS	DSCC	DSF	CIDEr
✔				136.3
✔	✔			137.4
✔		✔		136.6
✔	✔	✔		137.1
✔	✔		✔	137.7
✔		✔	✔	137.5
✔	✔	✔	✔	138.5
注：粗体数值表示最优值；“✔”表示应用此模块.

下载: 导出CSV

表 4 关于双重语义协作机制的消融实验结果

Table 4 Ablation Experiment Results of DSCC and DSCS Modules %

方法	BLEU-1	BLEU-4	METEOR	ROUGE	CIDEr	SPICE
w/o Region	82.0	40.3	29.8	59.8	136.3	23.5
w/o DSC	82.4	40.6	29.9	59.7	136.8	23.4
w/o DSCC	82.2	40.5	29.9	59.6	137.7	23.5
w/o DSCS	82.5	40.7	30.0	59.9	137.5	23.6
DSC-Net	82.9	41.1	30.1	59.9	138.5	23.7
注：加粗数值表示最优值.

下载: 导出CSV

表 5 特征融合的消融实验结果

Table 5 Ablation Experiment Results of Feature Fusion %

方法	BLEU-1	BLEU-4	METEOR	ROUGE	CIDEr	SPICE
w/o DSF	82.2	40.6	30.0	59.7	137.1	23.6
w/o RFD	82.1	40.5	29.9	59.7	137.3	23.6
DSC-Net	82.9	41.1	30.1	59.9	138.5	23.7
注：加粗数值表示最优值.

下载: 导出CSV

参考文献(21)

[1]	李志欣,魏海洋,张灿龙,等. 图像描述生成研究进展[J]. 计算机研究与发展,2021,58(9):1951−1974 doi: 10.7544/issn1000-1239.2021.20200281 Li Zhixin, Wei Haiyang, Zhang Canlong, et al. Research progress on image captioning[J]. Journal of Computer Research and Development, 2021, 58(9): 1951−1974 (in Chinese) doi: 10.7544/issn1000-1239.2021.20200281
[2]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proc of the 31st Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2017 : 5998−6008
[3]	Vinyals O, Toshev A, Bengio S, et al. Show and tell: A neural image caption generator[C]//Proc of the 28th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 3156−3164
[4]	Xu K, Ba J, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//Proc of the 32nd Int Conf on Machine Learning. New York: ACM, 2015: 2048−2057
[5]	Anderson P, He Xiaodong, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proc of the 31st IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 6077−6086
[6]	Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//Proc of the 38th Int Conf on Machine Learning. New York: ACM, 2021: 8748−8763
[7]	Guo Longteng, Liu Jing, Zhu Xinxin, et al. Normalized and geometry-aware self-attention network for image captioning[C]//Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10327−10336
[8]	Pan Yingwei, Yao Ting, Li Yehao, et al. X-linear attention networks for image captioning[C]//Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10971−10980
[9]	Shen Sheng, Li L H, Tan Hao, et al. How much can CLIP benefit vision-and-language tasks[C/OL]//Proc of the 10th Int Conf on Learning Representations. 2022[2022-01-29].https://openreview.net/forum?id=zf_Ll3HZWgy
[10]	Li Yehao, Pan Yingwei, Yao Ting, et al. Comprehending and ordering semantics for image captioning[C]//Proc of the 35th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 17990−17999
[11]	Karpathy A, Li Feifei. Deep visual-semantic alignments for generating image descriptions[C]//Proc of the 28th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 3128−3137
[12]	Yang Xu, Tang Kaihua, Zhang Hanwang, et al. Auto-encoding scene graphs for image captioning[C]//Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 10685−10694
[13]	Cornia M, Stefanini M, Baraldi L, et al. Meshed-memory transformer for image captioning[C]//Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10578−10587
[14]	Ji Jiayi, Luo Yunpeng, Sun Xiaoshuai, et al. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network[C]//Proc of the 35th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2021: 1655−1663
[15]	Yang Xu, Gao Chongyang, Zhang Hanwang, et al. Auto-parsing network for image captioning and visual question answering[C]//Proc of the 18th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 2197−2207
[16]	Rennie S J, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning[C]//Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 7008−7024
[17]	Sharma P, Ding Nan, Goodman S, et al. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning[C]//Proc of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2018: 2556−2565
[18]	Zhang Xuying, Sun Xiaoshuai, Luo Yunpeng, et al. RSTNet: Captioning with adaptive attention on visual and non-visual words[C]//Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 15465−15474
[19]	Yang Xuewen, Liu Yingru, Wang Xin. ReFormer: The relational transformer for image captioning[C]//Proc of the 30th ACM Int Conf on Multimedia. New York: ACM, 2022: 5398−5406
[20]	Wang Chi, Shen Yulin, Ji Luping. Geometry attention Transformer with position-aware LSTMs for image captioning[J]. Expert Systems with Applications, 2022, 201: 117174 doi: 10.1016/j.eswa.2022.117174
[21]	Huang Lun, Wang Wenmin, Chen Jie, et al. Attention on attention for image captioning[C]//Proc of the 17th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 4634−4643