高级检索
    刘茂福, 毕健旗, 周冰颖, 胡慧君. 基于依存句法的可解释图像描述生成[J]. 计算机研究与发展, 2023, 60(9): 2115-2126. DOI: 10.7544/issn1000-1239.202220432
    引用本文: 刘茂福, 毕健旗, 周冰颖, 胡慧君. 基于依存句法的可解释图像描述生成[J]. 计算机研究与发展, 2023, 60(9): 2115-2126. DOI: 10.7544/issn1000-1239.202220432
    Liu Maofu, Bi Jianqi, Zhou Bingying, Hu Huijun. Interpretable Image Caption Generation Based on Dependency Syntax[J]. Journal of Computer Research and Development, 2023, 60(9): 2115-2126. DOI: 10.7544/issn1000-1239.202220432
    Citation: Liu Maofu, Bi Jianqi, Zhou Bingying, Hu Huijun. Interpretable Image Caption Generation Based on Dependency Syntax[J]. Journal of Computer Research and Development, 2023, 60(9): 2115-2126. DOI: 10.7544/issn1000-1239.202220432

    基于依存句法的可解释图像描述生成

    Interpretable Image Caption Generation Based on Dependency Syntax

    • 摘要: 已有图像描述生成模型虽可以检测与表示图像目标实体及其视觉关系,但没有从文本句法关系角度关注模型的可解释性. 因而,提出基于依存句法三元组的可解释图像描述生成模型( interpretable image caption generation based on dependency syntax triplets modeling, IDSTM),以多任务学习的方式生成依存句法三元组序列和图像描述. IDSTM模型首先通过依存句法编码器从输入图像获得潜在的依存句法特征,并与依存句法三元组及文本词嵌入向量合并输入单层长短期记忆网络(long short-term memory, LSTM),生成依存句法三元组序列作为先验知识;接着,将依存句法特征输入到图像描述编码器中,提取视觉实体词特征;最后,采用硬限制和软限制2种机制,将依存句法和关系特征融合到双层LSTM,从而生成图像描述. 通过依存句法三元组序列生成任务,IDSTM在未显著降低生成的图像描述精确度的前提下,提高了其可解释性. 还提出了评测依存句法三元组序列生成质量的评价指标B1-DS (BLEU-1-DS), B4-DS (BLEU-4-DS), M-DS (METEOR-DS),并在MSCOCO数据集上的实验验证了IDSTM的有效性和可解释性.

       

      Abstract: Although existing image captioning models can detect and represent target objects and visual relationships, they have not focused on the interpretability of image captioning model from the perspective of syntactic relations. To this end, we present an interpretable image caption generation based on dependency syntax triplets modeling (IDSTM) , which leverages the multi-task learning to jointly generate the dependency syntax triplet sequence and image caption. IDSTM firstly obtains the potential dependency syntactic features from the input image through the dependency syntax encoder, and then incorporates these features with the dependency syntactic triplets and textual word embedding vectors into single LSTM (long short-term memory) to generate the dependency syntactic triplet sequence as the prior knowledge. Secondly, the dependency syntactic features are input into captioning encoder to extract the visual object textual features. Finally, hard and soft limitations are adopted to employ the dependency syntactic and relation features into double LSTM for the interpretable image caption generation. With the generation of the task of dependency syntax triplet sequence, IDSTM improves the interpretability of the image captioning model without a significant decrease on the accuracy of the generated captions. In addition, we propose novel metrics B1-DS (BLEU-1-DS), B4-DS (BLEU-4-DS) and M-DS (METEOR-DS) to testify the quality of dependency syntax triplets and to show extensive experimental results on MSCOCO dataset for evaluating the effectiveness and interpretability of IDSTM.

       

    /

    返回文章
    返回