Abstract:
Although existing image captioning models can detect and represent target objects and visual relationships, they have not focused on the interpretability of image captioning model from the perspective of syntactic relations. To this end, we present an interpretable image caption generation based on dependency syntax triplets modeling (IDSTM) , which leverages the multi-task learning to jointly generate the dependency syntax triplet sequence and image caption. IDSTM firstly obtains the potential dependency syntactic features from the input image through the dependency syntax encoder, and then incorporates these features with the dependency syntactic triplets and textual word embedding vectors into single LSTM (long short-term memory) to generate the dependency syntactic triplet sequence as the prior knowledge. Secondly, the dependency syntactic features are input into captioning encoder to extract the visual object textual features. Finally, hard and soft limitations are adopted to employ the dependency syntactic and relation features into double LSTM for the interpretable image caption generation. With the generation of the task of dependency syntax triplet sequence, IDSTM improves the interpretability of the image captioning model without a significant decrease on the accuracy of the generated captions. In addition, we propose novel metrics B1-DS (BLEU-1-DS), B4-DS (BLEU-4-DS) and M-DS (METEOR-DS) to testify the quality of dependency syntax triplets and to show extensive experimental results on MSCOCO dataset for evaluating the effectiveness and interpretability of IDSTM.