Abstract:
VideoQA (video question answering), which automatically answers natural language question according to the content of videos, is a relatively new research direction in the field of visual language and has attracted extensive attention in recent years. The solution of videoQA task is of great significance for human-computer interaction, intelligent education, intelligent transportation, scenario analysis, video retrieval, and other fields. VideoQA is a challenging task because it requires a model to understand semantic information of the video and the question to generate the answer. In this work, we analyze the difference between VideoQA and ImageQA (image question answering), and summarize four challenges faced by VideoQA relative to ImageQA. Then, the existing VideoQA models are carefully classified according to the research method around these challenges. Following the classifications, we introduce the generation background and focus on the implementation of models and the relationship between different models. After that, the benchmark datasets commonly used in VideoQA are summarized, the performances of current mainstream algorithms on some datasets are introduced in detail, and the comparison, analysis and summary are carried out. Finally, the future challenges and research trends in this field are discussed, which will provide some ideas for further research in the future.