ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development

Previous Articles     Next Articles

Research Progress on Image Captioning

Li Zhixin1 , Wei Haiyang1 , Zhang Canlong1 , Ma Huifang2 , Shi Zhongzhi3   

  1. 1Guangxi Key Laboratory of Multi-Source Information Mining and SecurityGuangxi Normal University), Guilin, Guangxi 541004)

    2College of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070)

    3Key Laboratory of Intelligent Information Processing(Institute of Computing Technology, Chinese Academy of Sciences), Chinese Academy of Sciences, Beijing 100190)

  • Online:2021-02-05
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (61966004, 61663004, 61866004, 61762078) and the Guangxi Natural Science Foundation (2019GXNSFDA245018, 2018GXNSFDA281009, 2017GXNSFAA198365).

Abstract: Image captioning combines the two research fields of computer vision and natural language processing. It requires not only complete image semantic understanding, but also complex natural language expression. It is a crucial task for further research on visual intelligence in line with human perception. This paper reviews the research progress on image captioning. Firstly, five key technologies involved in current deep learning based image captioning methods are summarized and analyzed, including overall architecture, learning strategy, feature mapping, language model and attention mechanism. Then, according to the development process, the existing image captioning methods are divided into four categories, i.e. template based methods, retrieval based methods, encoder-decoder architecture based methods and compositional architecture based methods. We describe the basic concepts, representative methods and research status of each category. Furthermore, we emphatically discuss the various methods based on encoder-decoder architechture and their innovative ideas, such as multimodal space, visual space, semantic space, attention mechanism, model optimization, and so on. Subsequently, from the experimental point of view, we show the common benchmark datasets and evaluation measures in the field of image captioning. In addition, we compare the performance of some typical methods on two benchmark datasets. Finally, based on improving the accuracy, integrity, novelty and diversity of image caption, several future development trends of image captioning are presented.

Key words: image captioning, encoder-decoder architecture, compositional architecture, attention mechanism, convolutional neural network, recurrent neural network, long short-term memory