Abstract:
Video question answering requires models to understand, fuse, and reason about the multimodal data in videos to assist people in quickly retrieving, analyzing, and summarizing complex scenes in videos, becoming a hot research topic in artificial intelligence. However, existing methods lack abilities of obtaining the motion details of visual objects in feature extraction, which may lead to false causality. In addition, in data fusion and reasoning, existing methods lack effective active learning ability, making it difficult to obtain prior knowledge beyond feature extraction, which affects the model’s deep understanding of multimodal content. To address these issues, we propose a multimodal knowledge-based active learning video question answering solution. The solution acquires the semantic correlation of visual targets in image sequences and the dynamic relationship with the surrounding environment to establish the motion trajectory of each visual target. Further, static content is supplemented with dynamic content to provide more accurate video feature expression for data fusion and reasoning. Then, the solution achieves self-improvement and logical thinking focus of multimodal information understanding through knowledge auto-enhancement multimodal data fusion and reasoning model, filling the gap in deep understanding of multimodal content. Experimental results show that the performance of our scheme is better than the most advanced video question answering algorithm, and a large number of ablation and visualization experiments also verify the rationality of this solution.