视频问答技术研究进展

包翠竹; 丁凯; 董建峰; 杨勋; 谢满德; 王勋

doi:10.7544/issn1000-1239.202220294

视频问答技术研究进展

Research Progress of Video Question Answering Technologies

摘要

摘要: 视频问答 ( video question answering，VideoQA ) 根据视频内容自动回答自然语言问题，是视觉语言领域较为新兴的一个研究方向, 近年来引起了广泛关注. VideoQA问题的解决对于人机交互、智慧教育、智能交通、场景分析以及视频检索等各个领域都有着重大意义. VideoQA是一项具有挑战性的任务，因为它需要模型同时理解视频与文本内容来生成问题的答案. 首先，分析了VideoQA与图像问答 ( image question answering，ImageQA )的区别，总结了当下VideoQA相对于ImageQA所面临的4个挑战；然后，围绕着这些挑战对目前现有VideoQA模型进行了细致的分类，并重点介绍了模型的实现及不同模型之间的关联；接着详细介绍了在VideoQA中常用的基准数据集及目前主流算法在部分数据集上的性能，并进行了对比与分析；最后，讨论了该领域未来面临的挑战和研究趋势，为未来进一步研究提供一些思路.

Abstract: VideoQA (video question answering), which automatically answers natural language question according to the content of videos, is a relatively new research direction in the field of visual language and has attracted extensive attention in recent years. The solution of videoQA task is of great significance for human-computer interaction, intelligent education, intelligent transportation, scenario analysis, video retrieval, and other fields. VideoQA is a challenging task because it requires a model to understand semantic information of the video and the question to generate the answer. In this work, we analyze the difference between VideoQA and ImageQA (image question answering), and summarize four challenges faced by VideoQA relative to ImageQA. Then, the existing VideoQA models are carefully classified according to the research method around these challenges. Following the classifications, we introduce the generation background and focus on the implementation of models and the relationship between different models. After that, the benchmark datasets commonly used in VideoQA are summarized, the performances of current mainstream algorithms on some datasets are introduced in detail, and the comparison, analysis and summary are carried out. Finally, the future challenges and research trends in this field are discussed, which will provide some ideas for further research in the future.

HTML全文

参考文献(174)

施引文献

资源附件(0)