• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Liu Maofu, Bi Jianqi, Zhou Bingying, Hu Huijun. Interpretable Image Caption Generation Based on Dependency Syntax[J]. Journal of Computer Research and Development, 2023, 60(9): 2115-2126. DOI: 10.7544/issn1000-1239.202220432
Citation: Liu Maofu, Bi Jianqi, Zhou Bingying, Hu Huijun. Interpretable Image Caption Generation Based on Dependency Syntax[J]. Journal of Computer Research and Development, 2023, 60(9): 2115-2126. DOI: 10.7544/issn1000-1239.202220432

Interpretable Image Caption Generation Based on Dependency Syntax

Funds: This work was supported by the Major Projects of the National Social Science Foundation of China (11&ZD189), the Planning Projects of Guizhou Provincial Science and Technology ([2020]3003), and the Graduate Innovation Fund of Wuhan University of Science and Technology (2022210).
More Information
  • Author Bio:

    Liu Maofu: born in 1977. PhD, professor. Senior member of CCF. His main research interests include natural language processing and intelligent media computing

    Bi Jianqi: born in 1998. MSc. His main research interest includes intelligent media computing

    Zhou Bingying: born in 2000. Master candidate. Her main research interest includes intelligent media computing

    Hu Huijun: born in 1976. PhD, associate professor. Her main research interests include natural language processing and intelligent media computing

  • Received Date: May 31, 2022
  • Revised Date: September 26, 2022
  • Available Online: April 13, 2023
  • Although existing image captioning models can detect and represent target objects and visual relationships, they have not focused on the interpretability of image captioning model from the perspective of syntactic relations. To this end, we present an interpretable image caption generation based on dependency syntax triplets modeling (IDSTM) , which leverages the multi-task learning to jointly generate the dependency syntax triplet sequence and image caption. IDSTM firstly obtains the potential dependency syntactic features from the input image through the dependency syntax encoder, and then incorporates these features with the dependency syntactic triplets and textual word embedding vectors into single LSTM (long short-term memory) to generate the dependency syntactic triplet sequence as the prior knowledge. Secondly, the dependency syntactic features are input into captioning encoder to extract the visual object textual features. Finally, hard and soft limitations are adopted to employ the dependency syntactic and relation features into double LSTM for the interpretable image caption generation. With the generation of the task of dependency syntax triplet sequence, IDSTM improves the interpretability of the image captioning model without a significant decrease on the accuracy of the generated captions. In addition, we propose novel metrics B1-DS (BLEU-1-DS), B4-DS (BLEU-4-DS) and M-DS (METEOR-DS) to testify the quality of dependency syntax triplets and to show extensive experimental results on MSCOCO dataset for evaluating the effectiveness and interpretability of IDSTM.

  • [1]
    Karpathy A, Li Feifei. Deep visual-semantic alignments for generating image descriptions[C]//Proc of the 2015 IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 3128−3137
    [2]
    Park C C, Kim B, Kim G. Towards personalized image captioning via multimodal memory networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(4): 999−1012
    [3]
    Vinyals O, Toshev A, Bengio S, et al. Show and tell: A neural image caption generator[C]//Proc of the 2015 IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 3156−3164
    [4]
    李志欣,魏海洋,张灿龙,等. 图像描述生成研究进展[J]. 计算机研究与发展,2021,58(9):1951−1974

    Li Zhixin, Wei Haiyang, Zhang Canlong, et al. Research progress on image captioning[J]. Journal of Computer Research and Development, 2021, 58(9): 1951−1974 (in Chinese)
    [5]
    Ren Shaoqing, He Kaiming, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137−1149 doi: 10.1109/TPAMI.2016.2577031
    [6]
    李志欣,魏海洋,黄飞成,等. 结合视觉特征和场景语义的图像描述生成[J]. 计算机学报,2020,43(9):1624−1640

    Li Zhixin, Wei Haiyang, Huang Feicheng, et al. Combine visual features and scene semantics for image captioning[J]. Chinese Journal of Computers, 2020, 43(9): 1624−1640 (in Chinese)
    [7]
    Wang Dalin, Beck D, Cohn T. On the role of scene graphs in image captioning[C]// Proc of the Beyond Vision and Language: Integrating Real-world Knowledge (LANTERN). Stroudsburg, PA: ACL, 2019: 29−34
    [8]
    Shi Zhan, Zhou Xu, Qiu Xipeng, et al. Improving image captioning with better use of caption[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 7454−7464
    [9]
    Li Xiangyang, Jiang Shuqiang. Know more say less: Image captioning based on scene graphs[J]. IEEE Transactions on Multimedia, 2019, 21(8): 2117−2130 doi: 10.1109/TMM.2019.2896516
    [10]
    Luo Yunpeng, Ji Jiayi, Sun Xiaoshuai, et al. Dual-level collaborative transformer for image captioning[C]//Proc of the 35th AAAI Conf on Artificial Intelligence. Menlo Park, CA: AAAI, 2021: 2286−2293
    [11]
    Ji Jiayi, Luo Yunpeng, Sun Xiaoshuai, et al. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network[C]//Proc of the 35th AAAI Conf on Artificial Intelligence. Menlo Park, CA: AAAI, 2021: 1655−1663
    [12]
    Xu K, Ba J, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//Proc of the 32nd Int Conf on Machine Learning. New York: ACM, 2015: 2048−2057
    [13]
    Deshpande A, Aneja J, Wang Liwei, et al. Fast, diverse and accurate image captioning guided by part-of-speech[C]//Proc of the 2019 IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 10695−10704
    [14]
    Hou Jingyi, Wu Xinxiao, Zhao Wentian, et al. Joint syntax representation learning and visual cue translation for video captioning[C]//Proc of the 2019 IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 8918−8927
    [15]
    Wang Yufei, Lin Zhe, Shen Xiaohui, et al. Skeleton key: Image captioning by skeleton-attribute decomposition[C]//Proc of the 2017 IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 7272−7281
    [16]
    Anderson P, He Xiaodong, Buehler C, et al. Bottom-Up and top-down attention for image captioning and visual question answering[C]//Proc of the 2018 IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 6077−6086
    [17]
    Pan Yingwei, Yao Ting, Li Yehao, et al. X-Linear attention networks for image captioning[C]//Proc of the 2020 IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10971−10980
    [18]
    Lu Jiasen, Xiong Caiming, Parikh D, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning[C]//Proc of the 2017 IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 375−383
    [19]
    He Chen, Hu Haifeng. Image captioning with visual-semantic double attention[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2019, 15(1): 1−16
    [20]
    Ben Huixia, Pan Yingwei, Li Yehao, et al. Unpaired image captioning with semantic-constrained self-learning[J]. IEEE Transactions on Multimedia, 2021, 24(1): 904−916
    [21]
    Rennie S J, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning[C]//Proc of the 2017 IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 7008−7024
    [22]
    Yang Liang, Hu Haifeng, Xing Songlong, et al. Constrained LSTM and residual attention for image captioning[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2020, 16(3): 1−18
    [23]
    Wang Cheng, Yang Haojin, Meinel C. Image captioning with deep bidirectional LSTMs and multi-task learning[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2018, 14(2s): 1−20
    [24]
    成科扬,王宁,师文喜,等. 深度学习可解释性研究进展[J]. 计算机研究与发展,2020,57(6):1208−1217

    Cheng Keyang, Wang Ning, Shi Wenxi, et al. Research advances in the interpretability of deep learning[J]. Journal of Computer Research and Development, 2020, 57(6): 1208−1217 (in Chinese)
    [25]
    Lu Xu, Liu Li, Nie Liqiang, et al. Semantic-driven interpretable deep multi-modal hashing for large-scale multimedia retrieval[J]. IEEE Transactions on Multimedia, 2020, 23(1): 4541−4554
    [26]
    Tang Zheng, Hahn-Powell G, Surdeanu M. Exploring interpretability in event extraction: Multitask learning of a neural event classifier and an explanation decoder[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Stroudsburg, PA: ACL, 2020: 169−175
    [27]
    Gonen H, Jawahar G, Seddah D, et al. Simple, interpretable and stable method for detecting words with usage change across corpora[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 538−555
    [28]
    Falenska A, Kuhn J. The (non-)utility of structural features in BiLSTM-based dependency parsers[C]//Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 117−128
    [29]
    Wang Xing, Tu Zhaopeng, Wang Longyue, et al. Self-attention with structural position representations[C]//Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing (EMNLP-IJCNLP). Stroudsburg, PA: ACL, 2019: 1403−1409
    [30]
    Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proc of the 31st Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2017: 6000−6010
    [31]
    Bugliarello E, Okazaki N. Enhancing machine translation with dependency-aware self-attention[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 1618−1627
    [32]
    Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: Common objects in context[C]//Proc of the 13th European Conf on Computer Vision. Berlin: Springer, 2014: 740−755
    [33]
    Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation[C]//Proc of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2002: 311−318
    [34]
    Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proc of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg, PA: ACL, 2005: 65−72
    [35]
    Lin C Y. ROUGE: A package for automatic evaluation of summaries[C]//Proc of the ACL Workshop on Text Summarization Branches Out. Stroudsburg, PA: ACL, 2004: 74−81
    [36]
    Vedantam R, Zitnick C, Parikh D. CIDEr: Consensus-based image description evaluation[C]//Proc of the 2015 IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 4566−4575
    [37]
    Manning C, Surdeanu M, Bauer J, et al. The Stanford CoreNLP natural language processing toolkit[C]//Proc of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Stroudsburg, PA: ACL, 2014: 55−60
    [38]
    Yao Ting, Pan Yingwei, Li Yehao, et al. Boosting image captioning with attributes[C]//Proc of the 2017 IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 4894−4902
  • Related Articles

    [1]Li Qinxin, Wu Wenhao, Wang Zhaohua, Li Zhenyu. DNS Recursive Resolution Service Security: Threats, Defenses, and Measurements[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440158
    [2]Research on Malicious Domain Detection Technology Based on Semantic Graph Learning[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440375
    [3]Wei Jinxia, Long Chun, Fu Hao, Gong Liangyi, Zhao Jing, Wan Wei, Huang Pan. Malicious Domain Name Detection Method Based on Enhanced Embedded Feature Hypergraph Learning[J]. Journal of Computer Research and Development, 2024, 61(9): 2334-2346. DOI: 10.7544/issn1000-1239.202330117
    [4]Pan Jianwen, Cui Zhanqi, Lin Gaoyi, Chen Xiang, Zheng Liwei. A Review of Static Detection Methods for Android Malicious Application[J]. Journal of Computer Research and Development, 2023, 60(8): 1875-1894. DOI: 10.7544/issn1000-1239.202220297
    [5]Fan Zhaoshan, Wang Qing, Liu Junrong, Cui Zelin, Liu Yuling, Liu Song. Survey on Domain Name Abuse Detection Technology[J]. Journal of Computer Research and Development, 2022, 59(11): 2581-2605. DOI: 10.7544/issn1000-1239.20210121
    [6]Yang Wang, Gao Mingzhe, Jiang Ting. A Malicious Code Static Detection Framework Based on Multi-Feature Ensemble Learning[J]. Journal of Computer Research and Development, 2021, 58(5): 1021-1034. DOI: 10.7544/issn1000-1239.2021.20200912
    [7]Peng Chengwei, Yun Xiaochun, Zhang Yongzheng, Li Shuhao. Detecting Malicious Domains Using Co-Occurrence Relation Between DNS Query[J]. Journal of Computer Research and Development, 2019, 56(6): 1263-1274. DOI: 10.7544/issn1000-1239.2019.20180481
    [8]Dai Hua, Qin Xiaolin, and Bai Chuanjie. A Malicious Transaction Detection Method Based on Transaction Template[J]. Journal of Computer Research and Development, 2010, 47(5): 921-929.
    [9]Li Qianmu and Liu Fengyu. A Risk Detection and Fault Analysis Method for the Strategic Internet[J]. Journal of Computer Research and Development, 2008, 45(10): 1718-1723.
    [10]Zhang Xiaoning and Feng Dengguo. Intrusion Detection for Ad Hoc Routing Based on Fuzzy Behavior Analysis[J]. Journal of Computer Research and Development, 2006, 43(4): 621-626.
  • Cited by

    Periodical cited type(1)

    1. 余莎莎,肖辉,郑清,赵幽. 基于威胁情报的DNS助力医院网络安全建设实践. 中国卫生信息管理杂志. 2024(06): 909-914 .

    Other cited types(1)

Catalog

    Article views (160) PDF downloads (92) Cited by(2)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return