An Image Captioning Method Based on DSC-Net

Jiang Zetao; Zhu Wencai; Jin Xin; Liao Peiqi; Huang Jingfan

doi:10.7544/issn1000-1239.202330523

Journal of Computer Research and Development > 2024 > 61(11): 2897-2908. > DOI: 10.7544/issn1000-1239.202330523 CSTR: 32373.14.issn1000-1239.202330523

Jiang Zetao, Zhu Wencai, Jin Xin, Liao Peiqi, Huang Jingfan. An Image Captioning Method Based on DSC-Net[J]. Journal of Computer Research and Development, 2024, 61(11): 2897-2908. DOI: 10.7544/issn1000-1239.202330523

Citation:

PDF (1148 KB)

An Image Captioning Method Based on DSC-Net

Guangxi Key Laboratory of Image and Graphic Intelligent Processing (Guilin University of Electronic Technology), Guilin, Guangxi 541004

Funds: This work was supported by the National Natural Science Foundation of China (62172118), the Guangxi Natural Science Key Foundation (2021GXNSFDA196002), the Project of Guangxi Key Laboratory of Image and Graphic Intelligent Processing (GIIP2302, GIIP2303, GIIP2304), the Innovation Project of Guangxi Graduate Education (YCSW2022269), and the Innovation Project of GUET Graduate Education (2023YCXS046)

More Information

Author Bio:
Jiang Zetao: born in 1961. PhD, professor. His main research interests include image processing, computer vision, and artificial intelligence

Zhu Wencai: born in 1999. Master. His main research interests include image processing and computer vision

Jin Xin: born in 1998. Master. His main research interests include image processing and computer vision

Liao Peiqi: born in 1995. Master. His main research interests include image processing and computer vision

Huang Jingfan: born in 1999. Master. His main research interests include image processing and computer vision
Received Date: June 18, 2023
Revised Date: December 18, 2023
Available Online: May 28, 2024

Graphical Abstract

Abstract

Abstract

As visual features closer to the text domain, the grid features extracted by the CLIP (contrastive language-image pre-training) image encoder are easy to convert into the corresponding semantic natural language, which can alleviate the semantic gap problem, so it may become an important source of visual features in the image captioning in the future. However, this method does not consider that the division of image content may cause a complete object to be divided into several grids. The segmentation of the objects will inevitably lead to the lack of a complete expression of the object information in the feature extraction results, and further lead to the lack of an accurate expression of the object and the relationship between the objects in the generated sentence. Aiming at the phenomenon of grid features extracted by CLIP image encoder, we propose dual semantic collaborative network (DSC-Net) for image captioning. Specifically, dual semantic collaborative self-attention (DSCS) module is first proposed to enhance the expression of object information by CLIP grid features. Then dual semantic collaborative cross-attention (DSCC) module is proposed to integrate semantic information between grid and object to generate visual features, and to be used to predict sentences. Finally, dual semantic fusion (DSF) module is proposed to provide region-oriented fusion features for the above two semantic cooperation modules, and to solve the problem of correlation conflicts that may arise in the process of semantic cooperation. After a large number of experiments on the COCO dataset, the proposed model achieves a CIDEr score of 138.5% on the offline test set divided by Karpathy et al., and a CIDEr score of 137.6% in the official online test. Compared with the current mainstream image captioning methods, this result has obvious advantages.

FullText(HTML)

References (21)

References

[1]	李志欣,魏海洋,张灿龙,等. 图像描述生成研究进展[J]. 计算机研究与发展,2021,58(9):1951−1974 doi: 10.7544/issn1000-1239.2021.20200281 Li Zhixin, Wei Haiyang, Zhang Canlong, et al. Research progress on image captioning[J]. Journal of Computer Research and Development, 2021, 58(9): 1951−1974 (in Chinese) doi: 10.7544/issn1000-1239.2021.20200281
[2]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proc of the 31st Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2017 : 5998−6008
[3]	Vinyals O, Toshev A, Bengio S, et al. Show and tell: A neural image caption generator[C]//Proc of the 28th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 3156−3164
[4]	Xu K, Ba J, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//Proc of the 32nd Int Conf on Machine Learning. New York: ACM, 2015: 2048−2057
[5]	Anderson P, He Xiaodong, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proc of the 31st IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 6077−6086
[6]	Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//Proc of the 38th Int Conf on Machine Learning. New York: ACM, 2021: 8748−8763
[7]	Guo Longteng, Liu Jing, Zhu Xinxin, et al. Normalized and geometry-aware self-attention network for image captioning[C]//Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10327−10336
[8]	Pan Yingwei, Yao Ting, Li Yehao, et al. X-linear attention networks for image captioning[C]//Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10971−10980
[9]	Shen Sheng, Li L H, Tan Hao, et al. How much can CLIP benefit vision-and-language tasks[C/OL]//Proc of the 10th Int Conf on Learning Representations. 2022[2022-01-29].https://openreview.net/forum?id=zf_Ll3HZWgy
[10]	Li Yehao, Pan Yingwei, Yao Ting, et al. Comprehending and ordering semantics for image captioning[C]//Proc of the 35th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 17990−17999
[11]	Karpathy A, Li Feifei. Deep visual-semantic alignments for generating image descriptions[C]//Proc of the 28th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 3128−3137
[12]	Yang Xu, Tang Kaihua, Zhang Hanwang, et al. Auto-encoding scene graphs for image captioning[C]//Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 10685−10694
[13]	Cornia M, Stefanini M, Baraldi L, et al. Meshed-memory transformer for image captioning[C]//Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10578−10587
[14]	Ji Jiayi, Luo Yunpeng, Sun Xiaoshuai, et al. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network[C]//Proc of the 35th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2021: 1655−1663
[15]	Yang Xu, Gao Chongyang, Zhang Hanwang, et al. Auto-parsing network for image captioning and visual question answering[C]//Proc of the 18th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 2197−2207
[16]	Rennie S J, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning[C]//Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 7008−7024
[17]	Sharma P, Ding Nan, Goodman S, et al. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning[C]//Proc of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2018: 2556−2565
[18]	Zhang Xuying, Sun Xiaoshuai, Luo Yunpeng, et al. RSTNet: Captioning with adaptive attention on visual and non-visual words[C]//Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 15465−15474
[19]	Yang Xuewen, Liu Yingru, Wang Xin. ReFormer: The relational transformer for image captioning[C]//Proc of the 30th ACM Int Conf on Multimedia. New York: ACM, 2022: 5398−5406
[20]	Wang Chi, Shen Yulin, Ji Luping. Geometry attention Transformer with position-aware LSTMs for image captioning[J]. Expert Systems with Applications, 2022, 201: 117174 doi: 10.1016/j.eswa.2022.117174
[21]	Huang Lun, Wang Wenmin, Chen Jie, et al. Attention on attention for image captioning[C]//Proc of the 17th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 4634−4643

[1]	Zhou Yuanding, Gao Guopeng, Fang Yaodong, Qin Chuan. Perceptual Authentication Hashing with Image Feature Fusion Based on Window Self-Attention[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202330669
[2]	Gao Wei, Chen Liqun, Tang Chunming, Zhang Guoyan, Li Fei. One-Time Chameleon Hash Function and Its Application in Redactable Blockchain[J]. Journal of Computer Research and Development, 2021, 58(10): 2310-2318. DOI: 10.7544/issn1000-1239.2021.20210653
[3]	Wu Linyang, Luo Rong, Guo Xueting, Guo Qi. Partitioning Acceleration Between CPU and DRAM: A Case Study on Accelerating Hash Joins in the Big Data Era[J]. Journal of Computer Research and Development, 2018, 55(2): 289-304. DOI: 10.7544/issn1000-1239.2018.20170842
[4]	Jiang Jie, Yang Tong, Zhang Mengyu, Dai Yafei, Huang Liang, Zheng Lianqing. DCuckoo: An Efficient Hash Table with On-Chip Summary[J]. Journal of Computer Research and Development, 2017, 54(11): 2508-2515. DOI: 10.7544/issn1000-1239.2017.20160795
[5]	Wang Wendi, Tang Wen, Duan Bo, Zhang Chunming, Zhang Peiheng, Sun Ninghui. Parallel Accelerator Design for High-Throughput DNA Sequence Alignment with Hash-Index[J]. Journal of Computer Research and Development, 2013, 50(11): 2463-2471.
[6]	Yuan Xinpan, Long Jun, Zhang Zuping, Luo Yueyi, Zhang Hao, and Gui Weihua. Connected Bit Minwise Hashing[J]. Journal of Computer Research and Development, 2013, 50(4): 883-890.
[7]	Qin Chuan, Chang Chin Chen, Guo Cheng. Perceptual Robust Image Hashing Scheme Based on Secret Sharing[J]. Journal of Computer Research and Development, 2012, 49(8): 1690-1698.
[8]	Ding Zhenhua, Li Jintao, Feng Bo. Research on Hash-Based RFID Security Authentication Protocol[J]. Journal of Computer Research and Development, 2009, 46(4): 583-592.
[9]	Li Zhiqiang, Chen Hanwu, Xu Baowen, Liu Wenjie. Fast Algorithms for Synthesis of Quantum Reversible Logic Circuits Based on Hash Table[J]. Journal of Computer Research and Development, 2008, 45(12): 2162-2171.
[10]	Liu Ji. One-Way Hash Function based on Integer Coupled Tent Maps and Its Performance Analysis[J]. Journal of Computer Research and Development, 2008, 45(3): 563-569.