Citation: | Jiang Zetao, Zhu Wencai, Jin Xin, Liao Peiqi, Huang Jingfan. An Image Captioning Method Based on DSC-Net[J]. Journal of Computer Research and Development, 2024, 61(11): 2897-2908. DOI: 10.7544/issn1000-1239.202330523 |
As visual features closer to the text domain, the grid features extracted by the CLIP (contrastive language-image pre-training) image encoder are easy to convert into the corresponding semantic natural language, which can alleviate the semantic gap problem, so it may become an important source of visual features in the image captioning in the future. However, this method does not consider that the division of image content may cause a complete object to be divided into several grids. The segmentation of the objects will inevitably lead to the lack of a complete expression of the object information in the feature extraction results, and further lead to the lack of an accurate expression of the object and the relationship between the objects in the generated sentence. Aiming at the phenomenon of grid features extracted by CLIP image encoder, we propose dual semantic collaborative network (DSC-Net) for image captioning. Specifically, dual semantic collaborative self-attention (DSCS) module is first proposed to enhance the expression of object information by CLIP grid features. Then dual semantic collaborative cross-attention (DSCC) module is proposed to integrate semantic information between grid and object to generate visual features, and to be used to predict sentences. Finally, dual semantic fusion (DSF) module is proposed to provide region-oriented fusion features for the above two semantic cooperation modules, and to solve the problem of correlation conflicts that may arise in the process of semantic cooperation. After a large number of experiments on the COCO dataset, the proposed model achieves a CIDEr score of 138.5% on the offline test set divided by Karpathy et al., and a CIDEr score of 137.6% in the official online test. Compared with the current mainstream image captioning methods, this result has obvious advantages.
[1] |
李志欣,魏海洋,张灿龙,等. 图像描述生成研究进展[J]. 计算机研究与发展,2021,58(9):1951−1974 doi: 10.7544/issn1000-1239.2021.20200281
Li Zhixin, Wei Haiyang, Zhang Canlong, et al. Research progress on image captioning[J]. Journal of Computer Research and Development, 2021, 58(9): 1951−1974 (in Chinese) doi: 10.7544/issn1000-1239.2021.20200281
|
[2] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proc of the 31st Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2017 : 5998−6008
|
[3] |
Vinyals O, Toshev A, Bengio S, et al. Show and tell: A neural image caption generator[C]//Proc of the 28th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 3156−3164
|
[4] |
Xu K, Ba J, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//Proc of the 32nd Int Conf on Machine Learning. New York: ACM, 2015: 2048−2057
|
[5] |
Anderson P, He Xiaodong, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proc of the 31st IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 6077−6086
|
[6] |
Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//Proc of the 38th Int Conf on Machine Learning. New York: ACM, 2021: 8748−8763
|
[7] |
Guo Longteng, Liu Jing, Zhu Xinxin, et al. Normalized and geometry-aware self-attention network for image captioning[C]//Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10327−10336
|
[8] |
Pan Yingwei, Yao Ting, Li Yehao, et al. X-linear attention networks for image captioning[C]//Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10971−10980
|
[9] |
Shen Sheng, Li L H, Tan Hao, et al. How much can CLIP benefit vision-and-language tasks[C/OL]//Proc of the 10th Int Conf on Learning Representations. 2022[2022-01-29].https://openreview.net/forum?id=zf_Ll3HZWgy
|
[10] |
Li Yehao, Pan Yingwei, Yao Ting, et al. Comprehending and ordering semantics for image captioning[C]//Proc of the 35th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 17990−17999
|
[11] |
Karpathy A, Li Feifei. Deep visual-semantic alignments for generating image descriptions[C]//Proc of the 28th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 3128−3137
|
[12] |
Yang Xu, Tang Kaihua, Zhang Hanwang, et al. Auto-encoding scene graphs for image captioning[C]//Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 10685−10694
|
[13] |
Cornia M, Stefanini M, Baraldi L, et al. Meshed-memory transformer for image captioning[C]//Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10578−10587
|
[14] |
Ji Jiayi, Luo Yunpeng, Sun Xiaoshuai, et al. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network[C]//Proc of the 35th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2021: 1655−1663
|
[15] |
Yang Xu, Gao Chongyang, Zhang Hanwang, et al. Auto-parsing network for image captioning and visual question answering[C]//Proc of the 18th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 2197−2207
|
[16] |
Rennie S J, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning[C]//Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 7008−7024
|
[17] |
Sharma P, Ding Nan, Goodman S, et al. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning[C]//Proc of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2018: 2556−2565
|
[18] |
Zhang Xuying, Sun Xiaoshuai, Luo Yunpeng, et al. RSTNet: Captioning with adaptive attention on visual and non-visual words[C]//Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 15465−15474
|
[19] |
Yang Xuewen, Liu Yingru, Wang Xin. ReFormer: The relational transformer for image captioning[C]//Proc of the 30th ACM Int Conf on Multimedia. New York: ACM, 2022: 5398−5406
|
[20] |
Wang Chi, Shen Yulin, Ji Luping. Geometry attention Transformer with position-aware LSTMs for image captioning[J]. Expert Systems with Applications, 2022, 201: 117174 doi: 10.1016/j.eswa.2022.117174
|
[21] |
Huang Lun, Wang Wenmin, Chen Jie, et al. Attention on attention for image captioning[C]//Proc of the 17th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 4634−4643
|
[1] | Fu Nan, Ni Weiwei, Jiang Zepeng, Hou Lihe, Zhang Dongyue, Zhang Ruyu. Directed Graph Clustering Algorithm with Edge Local Differential Privacy[J]. Journal of Computer Research and Development, 2025, 62(1): 256-268. DOI: 10.7544/issn1000-1239.202330193 |
[2] | Deng Xinguo, Zhang Xinhong, Chen Jiarui, Liu Qinghai, Chen Chuandong. A Weighted Directed Graph-Based Algorithm for Group Routing in Printed Circuit Boards[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440069 |
[3] | Shang Jing, Wu Zhihui, Xiao Zhiwen, Zhang Yifei. Graph4Cache: A Graph Neural Network Model for Cache Prefetching[J]. Journal of Computer Research and Development, 2024, 61(8): 1945-1956. DOI: 10.7544/issn1000-1239.202440190 |
[4] | Zhang Tianming, Xu Yiheng, Cai Xinwei, Fan Jing. A Shortest Path Query Method over Temporal Graphs[J]. Journal of Computer Research and Development, 2022, 59(2): 362-375. DOI: 10.7544/issn1000-1239.20210893 |
[5] | Zhang Xin, Li Xiaoguang. Spanner Algorithm for Directed Graph Stream[J]. Journal of Computer Research and Development, 2019, 56(3): 655-665. DOI: 10.7544/issn1000-1239.2019.20170680 |
[6] | Wang Juanjuan, Qiao Ying, Wang Hongan. Graph-Based Auto-Driving Reasoning Task Scheduling[J]. Journal of Computer Research and Development, 2017, 54(8): 1693-1702. DOI: 10.7544/issn1000-1239.2017.20170212 |
[7] | He Wei, Zhao Ruilian, and Zhu Qunxiong. Call-Graph-Based Interclass MM Path Generation[J]. Journal of Computer Research and Development, 2013, 50(2): 332-343. |
[8] | Xu Shifeng, Gao Jun, Yang Dongqing, and Wang Tengjiao. Pass-Count-Based Path Query on Big Graph Datasets[J]. Journal of Computer Research and Development, 2010, 47(1): 96-103. |
[9] | Yao Guohui, Zhu Daming, and Ma Shaohan. Approximating the Directed Minimum Degree Spanning Tree of Directed Acyclic Graph[J]. Journal of Computer Research and Development, 2009, 46(6): 1052-1057. |
[10] | Wen Yanzhi, Lian Ruiqi, Wu Chengyong, Feng Xiaobing, and Zhang Zhaoqing. A Micro-Scheduling Method on Directed Cyclic Graph[J]. Journal of Computer Research and Development, 2005, 42(3). |