高级检索
    刘明阳, 王若梅, 周凡, 林格. 基于多模态知识主动学习的视频问答方案[J]. 计算机研究与发展, 2024, 61(4): 889-902. DOI: 10.7544/issn1000-1239.202221008
    引用本文: 刘明阳, 王若梅, 周凡, 林格. 基于多模态知识主动学习的视频问答方案[J]. 计算机研究与发展, 2024, 61(4): 889-902. DOI: 10.7544/issn1000-1239.202221008
    Liu Mingyang, Wang Ruomei, Zhou Fan, Lin Ge. Video Question Answering Scheme Base on Multimodal Knowledge Active Learning[J]. Journal of Computer Research and Development, 2024, 61(4): 889-902. DOI: 10.7544/issn1000-1239.202221008
    Citation: Liu Mingyang, Wang Ruomei, Zhou Fan, Lin Ge. Video Question Answering Scheme Base on Multimodal Knowledge Active Learning[J]. Journal of Computer Research and Development, 2024, 61(4): 889-902. DOI: 10.7544/issn1000-1239.202221008

    基于多模态知识主动学习的视频问答方案

    Video Question Answering Scheme Base on Multimodal Knowledge Active Learning

    • 摘要: 视频问答是人工智能领域的一个热点研究问题. 现有方法在特征提取方面缺乏针对视觉目标运动细节的获取,从而会导致错误因果关系的建立. 此外,在数据融合与推理过程中,现有方法缺乏有效的主动学习能力,难以获取特征提取之外的先验知识,影响了模型对多模态内容的深度理解. 针对这些问题,首先,设计了一种显性多模态特征提取模块,通过获取图像序列中视觉目标的语义关联以及与周围环境的动态关系来建立每个视觉目标的运动轨迹. 进一步通过动态内容对静态内容的补充,为数据融合与推理提供了更加精准的视频特征表达. 其次,提出了知识自增强多模态数据融合与推理模型,实现了多模态信息理解的自我完善和逻辑思维聚焦,增强了对多模态特征的深度理解,减少了对先验知识的依赖. 最后,提出了一种基于多模态知识主动学习的视频问答方案. 实验结果表明,该方案的性能优于现有最先进的视频问答算法,大量的消融和可视化实验也验证了方案的合理性.

       

      Abstract: Video question answering requires models to understand, fuse, and reason about the multimodal data in videos to assist people in quickly retrieving, analyzing, and summarizing complex scenes in videos, becoming a hot research topic in artificial intelligence. However, existing methods lack abilities of obtaining the motion details of visual objects in feature extraction, which may lead to false causality. In addition, in data fusion and reasoning, existing methods lack effective active learning ability, making it difficult to obtain prior knowledge beyond feature extraction, which affects the model’s deep understanding of multimodal content. To address these issues, we propose a multimodal knowledge-based active learning video question answering solution. The solution acquires the semantic correlation of visual targets in image sequences and the dynamic relationship with the surrounding environment to establish the motion trajectory of each visual target. Further, static content is supplemented with dynamic content to provide more accurate video feature expression for data fusion and reasoning. Then, the solution achieves self-improvement and logical thinking focus of multimodal information understanding through knowledge auto-enhancement multimodal data fusion and reasoning model, filling the gap in deep understanding of multimodal content. Experimental results show that the performance of our scheme is better than the most advanced video question answering algorithm, and a large number of ablation and visualization experiments also verify the rationality of this solution.

       

    /

    返回文章
    返回