Citation: | Liu Mingyang, Wang Ruomei, Zhou Fan, Lin Ge. Video Question Answering Scheme Base on Multimodal Knowledge Active Learning[J]. Journal of Computer Research and Development, 2024, 61(4): 889-902. DOI: 10.7544/issn1000-1239.202221008 |
Video question answering requires models to understand, fuse, and reason about the multimodal data in videos to assist people in quickly retrieving, analyzing, and summarizing complex scenes in videos, becoming a hot research topic in artificial intelligence. However, existing methods lack abilities of obtaining the motion details of visual objects in feature extraction, which may lead to false causality. In addition, in data fusion and reasoning, existing methods lack effective active learning ability, making it difficult to obtain prior knowledge beyond feature extraction, which affects the model’s deep understanding of multimodal content. To address these issues, we propose a multimodal knowledge-based active learning video question answering solution. The solution acquires the semantic correlation of visual targets in image sequences and the dynamic relationship with the surrounding environment to establish the motion trajectory of each visual target. Further, static content is supplemented with dynamic content to provide more accurate video feature expression for data fusion and reasoning. Then, the solution achieves self-improvement and logical thinking focus of multimodal information understanding through knowledge auto-enhancement multimodal data fusion and reasoning model, filling the gap in deep understanding of multimodal content. Experimental results show that the performance of our scheme is better than the most advanced video question answering algorithm, and a large number of ablation and visualization experiments also verify the rationality of this solution.
[1] |
俞俊,汪亮,余宙. 视觉问答技术研究[J]. 计算机研究与发展,2018,55(9):1946−1958 doi: 10.7544/issn1000-1239.2018.20180168
Yu Jun, Wang Liang, Yu Zhou. Research on visual question answering techniques[J]. Journal of Computer Research and Development, 2018, 55(9): 1946−1958 (in Chinese) doi: 10.7544/issn1000-1239.2018.20180168
|
[2] |
张璐,曹峰,梁新彦,等. 基于关联特征传播的跨模态检索[J]. 计算机研究与发展,2022,59(9):1993−2002 doi: 10.7544/issn1000-1239.20210475
Zhang Lu, Cao Feng, Liang Xinyan, et al. Cross-modal retrieval with correlation feature propagation[J]. Journal of Computer Research and Development, 2022, 59(9): 1993−2002 (in Chinese) doi: 10.7544/issn1000-1239.20210475
|
[3] |
李志欣,魏海洋,张灿龙,等. 图像描述生成研究进展[J]. 计算机研究与发展,2021,58(9):1951−1974 doi: 10.7544/issn1000-1239.2021.20200281
Li Zhixin, Wei Haiyang, Zhang Canlong, et al. Research progress on image captioning[J]. Journal of Computer Research and Development, 2021, 58(9): 1951−1974 (in Chinese) doi: 10.7544/issn1000-1239.2021.20200281
|
[4] |
He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition[C]// Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 770–778
|
[5] |
Hara K, Kataoka H, Satoh Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imageNet[C]// Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 6546–6555
|
[6] |
Peter A, He Xiaodong, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 6077–6086
|
[7] |
Lu Jiasen, Yang Jianwei, Batra D, et al. Hierarchical question-image co-attention for visual question answering[C]// Proc of the 30th Int Conf on Neural Information Proc Systems. New York: ACM, 2016: 289–297
|
[8] |
Gao Jiyang, Ge Runzhou, Chen Kan, et al. Motion appearance co-memory networks for video question answering[C]// Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 6576–6585
|
[9] |
Dang L H, Le T, Le V, et al. Hierarchical object-oriented spatio-temporal reasoning for video question answering[C]// Proc of the 30th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2021: 636–642
|
[10] |
Jiang Jianwen, Chen Ziqiang, Lin Haojie, et al. Divide and conquer: Question-guided spatio-temporal conrmual attention for video question answering[C]// Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 11101–11108
|
[11] |
Wang Jianyu, Bao Bingkun, Xu Changsheng. DualVGR: A dual-visual graph reasoning unit for video question answering[J]. IEEE Transactions on Multimedia, 2022, 24: 3369−3380 doi: 10.1109/TMM.2021.3097171
|
[12] |
Le T M, Le V, Venkatesh S, et al. Hierarchical conditional relation networks for video question answering[C]// Proc of the 38th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 9969–9978
|
[13] |
Liu Fei, Liu Jing, Wang Weining, et al. HAIR: Hierarchical visual-semantic relational reasoning for video question answering[C]// Proc of the 18th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 1678–1687
|
[14] |
Jang Y, Song Y, Yu Y, et al. TGIF-QA: Toward spatiotemporal reasoning in visual question answering[C]// Proc of the 33rd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 1359–1367
|
[15] |
Xu Dejing, Zhao Zhou, Xiao Jun, et al. Video question answering via gradually refined attention over appearance and motion[C]// Proc of the 25th ACM Int Conf on Multimedia. New York: ACM, 2017: 1645–1653
|
[16] |
Xu Jun, Mei Tao, Yao Ting, et al. MSRVTT: A large video description dataset for bridging video and language[C]// Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 5288–5296
|
[17] |
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint, arXiv: 1409.1556v6, 2015
|
[18] |
Zhao Zhou, Zhang Zhu, Xiao Shuwen, et al. Open-ended long-form video question answering via adaptive hierarchical reinforced networks[C]// Proc of the 27th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2018: 3683–3689
|
[19] |
Huang Deng, Chen Peihao, Zeng Runhao, et al. Location-aware graph convolutional networks for video question answering[C]// Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 11021–11028
|
[20] |
Seo A, Kang G, Park J, et al. Attend what you need: Motion-appearance synergistic networks for video question answering[C]// Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing. Stroudsburg, PA: ACL, 2021: 6167–6177
|
[21] |
Jiang Pin, Han Yahong. Reasoning with heterogeneous graph alignment for video question answering[C]// Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 11109–11116
|
[22] |
Park J, Lee J, Sohn K. Bridge to answer: Structure-aware graph interaction network for video question answering[C]// Proc of the 39th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 15526–15535
|
[23] |
Russakovsky O, Deng Jia, Su Hao, et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211−252 doi: 10.1007/s11263-015-0816-y
|
[24] |
Will K, Carreira J, Simonyan K, et al. The Kinetics human action video dataset[J]. arXiv preprint, arXiv: 1705.06950, 2017
|
[25] |
Zeng Pengpeng, Zhang Haonan, Gao Lianli, et al. Video question answering with prior knowledge and object-sensitive learning[J]. IEEE Transactions on Image Processing, 2022, 31: 5936−5948 doi: 10.1109/TIP.2022.3205212
|
[26] |
Chen Xinlei, Fang Hao, Lin Tsung-Yi, et al. Microsoft COCO captions: Data collection and evaluation server[J]. arXiv preprint, arXiv: 1504.00325, 2015
|
[27] |
Lei Jie, Li Linjie, Zhou Luowei, et al. Less is more: ClipBERT for video-and-language learning via sparse sampling[C]// Proc of the 39th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 7331–7341
|
[28] |
Seo P H, Nagrani A, Schmid C. Look before you speak: Visually conrmualized utterances[C]// Proc of the 39th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 16877–16887
|
[29] |
Devlin J, Chang Mingwei, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]// Proc of the 14th Conf of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 4171–4186
|
[30] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]// Proc of the 31st Int Conf on Neural Information Processing Systems. New York: ACM, 2017: 5998–6008
|
[31] |
Gao Lianli, Lei Yu, Zeng Pengpeng, et al. Hierarchical representation network with auxiliary tasks for video captioning and video question answering[J]. IEEE Transactions on Image Processing, 2022, 31: 202−215 doi: 10.1109/TIP.2021.3120867
|