视频问答技术研究进展

包翠竹; 丁凯; 董建锋; 杨勋; 谢满德; 王勋

doi:10.7544/issn1000-1239.202220294

视频问答技术研究进展

包翠竹^{1, 3,},
丁凯²,
董建锋^{1, 3},
杨勋⁴,
谢满德^2, ,,
王勋^{1, 3}

1.
浙江工商大学计算机科学与技术学院　杭州　310018
2.
浙江工商大学信息与电子工程学院　杭州　310018
3.
浙江省电子商务与物流信息技术研究重点实验室（浙江工商大学）　杭州　310018
4.
中国科学技术大学信息科学技术学院　合肥　230026

基金项目: 国家自然科学基金项目（61972352，61902347，61976188，62272435，U22A2094）；浙江省重点研发计划项目（2021C03150）；浙江省省属高校基本科研业务费专项

详细信息

作者简介:
包翠竹: 1990年生. 博士，讲师.CCF会员. 主要研究方向为计算机视觉、智能交通控制、智慧城市

丁凯: 1995年生. 硕士研究生.CCF学生会员. 主要研究方向为计算机视觉、视觉问答、视频问答

董建锋: 1991年生.博士，教授.CCF会员. 主要研究方向为多媒体理解、计算机视觉、机器学习、人工智能

杨勋: 1989年生.博士，教授.CCF会员. 主要研究方向为跨媒体分析与推理、多媒体内容结构化理解、视觉媒体相关性关联建模

谢满德: 1977年生.博士. 教授. CCF会员.主要研究方向为无线传感器网络、云计算、网络安全、群智感知

王勋: 1967年生.博士, 教授.CCF杰出会员. 主要研究方向为可视媒体大数据技术、移动图形计算、计算机视觉、智能信息处理与可视分析

通讯作者:
谢满德（xiemd@zjgsu.edu.cn）

中图分类号: TP391
计量
- 文章访问数: 480
- HTML全文浏览量: 50
- PDF下载量: 156
出版历程
- 收稿日期: 2022-04-11
- 修回日期: 2023-06-06
- 网络出版日期: 2023-12-07
- 刊出日期: 2024-03-01

Research Progress of Video Question Answering Technologies

Bao Cuizhu^{1, 3,},
Ding Kai²,
Dong Jianfeng^{1, 3},
Yang Xun⁴,
Xie Mande^2, ,,
Wang Xun^{1, 3}

1.
School of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou 310018
2.
School of Information & Electronic Engineering, Zhejiang Gongshang University, Hangzhou 310018
3.
Zhejiang E-Commerce Key Lab(Zhejiang Gongshang University), Hangzhou 310018
4.
School of Information Science and Technology, University of Science and Technology of China, Hefei 230026

Funds: This work was supported by the National Natural Science Foundation of China (61972352，61902347，61976188，62272435，U22A2094), the Key Research and Development Program of Zhejiang Province (2021C03150), and the Fundamental Research Funds for the Provincial Universities of Zhejiang.

More Information

Author Bio:
Bao Cuizhu: born in 1995. PhD, lecturer. Member of CCF. Her main research interests include computer vision, intelligent traffic control, and smart city

Ding Kai: born in 1995. Master candidate. Student member of CCF. His main research interests include computer vision, and visual question answering, and video question answering

Dong Jianfeng: born in 1991. PhD, professor. Member of CCF. His main research interests include multimedia understanding, computer vision, machine learning, and artificial intelligence

Yang Xun: born in 1989. PhD, professor. Member of CCF. His main research interests include cross-media analysis and reasoning, structured understanding of multimedia content, and visual media correlation modeling

Xie Mande: born in 1977. PhD, professor. Member of CCF. His main research interests include wireless sensor networks, cloud computing, network security, and swarm intelligence perception

Wang Xun: born in 1967. PhD, professor. Distinguished member of CCF. His main research interests include visual media big data technology, mobile graphics computing, computer vision, and intelligent information processing and visual analysis

摘要

摘要:
视频问答 ( video question answering，VideoQA ) 根据视频内容自动回答自然语言问题，是视觉语言领域较为新兴的一个研究方向, 近年来引起了广泛关注. VideoQA问题的解决对于人机交互、智慧教育、智能交通、场景分析以及视频检索等各个领域都有着重大意义. VideoQA是一项具有挑战性的任务，因为它需要模型同时理解视频与文本内容来生成问题的答案. 首先，分析了VideoQA与图像问答 ( image question answering，ImageQA )的区别，总结了当下VideoQA相对于ImageQA所面临的4个挑战；然后，围绕着这些挑战对目前现有VideoQA模型进行了细致的分类，并重点介绍了模型的实现及不同模型之间的关联；接着详细介绍了在VideoQA中常用的基准数据集及目前主流算法在部分数据集上的性能，并进行了对比与分析；最后，讨论了该领域未来面临的挑战和研究趋势，为未来进一步研究提供一些思路.
- 视频问答 /
- 注意力 /
- 记忆网络 /
- 循环神经网络 /
- 图网络模型 /
- 预训练模型
Abstract:
VideoQA (video question answering), which automatically answers natural language question according to the content of videos, is a relatively new research direction in the field of visual language and has attracted extensive attention in recent years. The solution of videoQA task is of great significance for human-computer interaction, intelligent education, intelligent transportation, scenario analysis, video retrieval, and other fields. VideoQA is a challenging task because it requires a model to understand semantic information of the video and the question to generate the answer. In this work, we analyze the difference between VideoQA and ImageQA (image question answering), and summarize four challenges faced by VideoQA relative to ImageQA. Then, the existing VideoQA models are carefully classified according to the research method around these challenges. Following the classifications, we introduce the generation background and focus on the implementation of models and the relationship between different models. After that, the benchmark datasets commonly used in VideoQA are summarized, the performances of current mainstream algorithms on some datasets are introduced in detail, and the comparison, analysis and summary are carried out. Finally, the future challenges and research trends in this field are discussed, which will provide some ideas for further research in the future.
- video question answering /
- attention /
- memory network /
- recurrent neural network /
- graph network model /
- pre-training model

HTML全文

一直以来，高性能计算机（high performance computer, HPC）都是解决科学研究各领域实际问题的重要工具，依靠HPC的强大计算能力，能使很多求解空间极大的科研问题在现实可见的时间内完成解算.

最近20年，科学研究已经从计算科学时代进入数据科学范式时代，科学家需要从海量的数据中去探索科学规律和突破科学发展瓶颈，这就意味着传统的用高密度计算去模拟复杂现象进行科学研究的方法需要创新发展，用高性能计算与人工智能相融合的新方法（HPC+AI）去解决实际问题，正逐渐成为一种行之有效的科研方法，例如2020年的戈登贝尔高性能计算应用奖就颁发给基于深度学习实现1亿原子分子动力学的应用^[1].

人工智能应用的开发和运行，往往依赖于人工智能编程框架，如TensorFlow^[2]，Pytorch^[3]等，这些框架在本质上均是数据流计算系统，它们将神经网络模型组织成数据流图，并利用图节点融合、图剪枝、常量传播等技术进行图优化，然后再通过运行系统将图节点调度到实际的计算资源上执行. 换言之，数据流计算系统是支撑人工智能应用的重要基础软件，但是，要在国产高性能计算机上支持高效的数据流系统，则面临着严峻的挑战.

从底层硬件的角度来说，国产异构众核处理器具有独特的复杂结构，新一代国产异构处理器sw26010pro^[4-5]具有多级计算资源、多层次存储和多级互联网络结构，在体系架构上与传统的多核CPU、众核GPU以及专用的人工智能处理器相比有着本质的区别.

要在sw26010pro上高效执行数据流系统，需重点解决2个问题：

1）如何充分利用sw26010pro的众核计算资源. 计算核心阵列是国产异构众核处理器的性能来源，具有众多的精简核心和强大算力，但也存在着访存效率低、片上缓存小和管理复杂等实际问题. 为实现数据流图的高效执行，就需要实现自适应的众核阵列加速方法，能够自动加速数据流图中的关键节点，充分利用众核计算资源.

2）如何设计高效的两级并行策略，充分利用国产异构众核处理器的全片计算资源.sw26010pro采用全片多核组集成的体系结构，每个核组都是同构的众核阵列，多核组之间可以共享全局存储，数据流图中的节点执行以单核组为基本单元. 要充分结合硬件结构特性，实现两级并行策略，通过相应的图优化和调度方法，确保多核组能够并行执行数据流图，提升系统性能.

对于上层用户而言，传统高性能计算机的软件环境也很难满足HPC+AI领域应用的动态化和智能化需求，并且，用户更希望将重心放在上层算法设计上，而非底层体系结构相关的优化上.

为此，本文提出了一种面向国产异构众核处理器的数据流计算系统swFLOWpro，支持使用TensorFlow接口构建数据流计算和深度学习典型模型，并实现了对用户透明的众核并行加速，可以支持数据流程序的高效开发，在运行时充分利用国产异构众核处理器的硬件能力.

本文的主要贡献有4点:

1）在国产异构众核处理器上构建了功能完备的数据流计算系统swFLOWpro，能够支持以深度学习为代表的数据流应用的开发和运行；

2）设计并实现了一种专门针对国产异构众核处理器的核心计算加速引擎swHMAE，将之与数据流计算系统松耦合，实现自动化的众核并行加速及算子分析、调试功能；

3）针对sw26010pro的多核组共享内存结构，设计了一种面向异构融合体系结构的两级并行策略，结合图分裂技术，能充分利用全片核组的计算资源；

4）基于swFLOWpro进行Alexnet，ResNet，VGG，Inception等典型CNN神经网络模型训练测试，实验结果表明本文设计的数据流计算系统能够获得很好的异构众核并行加速效果.

1. 相关工作

1.1 数据流计算

传统的冯•诺依曼计算机以控制流为执行模型，而数据流计算则采用了不同的思路，将程序组织成有向图，每个图节点表示一个算子，边则代表节点之间的依赖关系，数据在边上流动，当一个节点的所有输入数据均已就绪时，该节点就会被启动. 数据流计算由程序本身的数据依赖关系来激活计算，更有利于充分发挥其天然的可并行性.

20世纪90年代，麻省理工大学提出一种基于数据流思想的处理器设计方案，该方案没有共享存储和寄存器的设计，数据直接在计算部件之间流动. 当一条指令所有操作数均已就绪时，即可以进入执行状态. 这种体系架构能够充分挖掘程序的指令级可并行性，但也存在着运行开销大、并行粒度过小等实际问题，与传统计算机系统的天渊之别也限制了其进一步发展.

相关研究^[6]还提出了一种硬件集成数据流芯片和冯•诺依曼架构芯片的体系结构设计，程序在编译系统的支持下，可以在运行过程中动态调度到不同的芯片架构上去. 这种处理器架构设计比较新颖，但对编译和硬件实现的环境要求比较高.

纯硬件的数据流计算系统面临着诸多问题，最本质的问题在于其与传统计算机软硬件生态无法兼容，发展严重受限. 于是，数据流计算机逐渐向与冯•诺依曼架构融合的方向发展，出现了“类数据流”计算机，这一类计算机融合了控制流和数据流的思想，将程序组织成一系列的宏指令或者代码块，每个宏指令或代码块内部采用数据流执行模式，而在宏指令和代码块之间依然采用传统的控制流思想进行组织管理，该类计算机包括TRIPS^[7]，T3，EVX等. 类数据流计算机将数据流的思想用于最底层的指令层面，在程序层面则保持着和传统架构相同的程序逻辑，例如EDGE^[8]架构执行模型，就是将程序编译成由超块组成的控制流图，将超块内部的代码编译成数据流指令，数据直接在计算部件之间流动而不通过寄存器，但EDGE架构必须运行在专门的类数据流处理器架构上，通用性较差. 目前，最常见的数据流系统是在传统冯•诺依曼架构上实现的软件数据流系统.

Codelet^[9-10]执行模型由特拉华大学提出，它是一种针对E级计算机的需求而进行设计的细粒度并行、事件驱动的程序执行模型. Codelet模型从数据流执行模型中得到启发，结合传统的冯•诺依曼体系架构，形成了一种在通用计算机上运行的数据流程序执行系统.

TensorFlow是一款具有数据流思想的软件计算系统，该系统运行于通用处理器架构上，并对众核GPU和人工智能专用芯片TPU有后端支持.TensorFlow是人工智能领域非常热门的编程框架，它将人工智能的算法模型组织成数据流图，并通过运行支持数据流图的高效映射和资源分配.TensorFlow给用户提供了丰富的API接口来构建数据流计算，对深度学习的支持也比较完善，不过缺乏对于国产异构众核架构的后端支持.

1.2 在国产异构众核处理器上的数据流计算系统

在国产异构众核处理器上，关于数据流计算系统的研究也一直在进行中.

SunwayFlow^[11]是基于神威太湖之光高性能计算机系统开发的数据流计算系统，该系统将Codelet执行模型移植到国产处理器上，并使用高性能共轭梯度基准测试（HPCG）作为测试数据，获得10.32倍的加速效果. 但SunwayFlow支持的Codelet模型适用范围有限，特别是对深度学习的支持严重不足.

swCaffe^[12-13]是面向国产异构众核处理器的深度学习编程框架，它在底层通过swDNN库支持众核加速，针对VGG-16有4倍的加速效果. 但是Caffe框架^[14]的编程接口已逐渐被淘汰，而swCaffe要实现众核加速，对模型的参数也有严格的限制，已无法适应数据流计算和深度学习应用的实际需求.

swFLOW^[15]是2021年推出的针对国产异构众核处理器的数据流计算系统，该系统重构了TensorFlow框架，支持在sw26010处理器上执行数据流计算，针对典型神经网络模型有10.42倍的加速. 不过，swFLOW在功能上只支持TensorFlow的C++接口，在优化设计上重点考虑面向大规模计算资源的分布式训练，缺乏针对单进程的深度优化，也没有针对全片多核组运行模式的优化支持，实际使用效果有待增强.

针对上述多款数据流计算系统软件的缺陷，本文设计并实现了面向国产异构众核处理器sw26010pro的新一代数据流计算系统swFLOWpro.该系统在编程接口支持上复用了TensorFlow的前端模块，可以完全兼容TensorFlow的Python和C++编程接口，提升系统易用性，在后端则通过独立的核心计算加速引擎模块来提供针对数据流图节点的执行加速，除此之外，还针对sw26010pro的多核组共享内存设计，开发了一种面向异构融合的两级并行方法，从而提升全片计算资源的应用效率.

1.3 面向数据流计算的相关研究

Megatron-LM^[16]主要讨论如何在大规模GPU集群上通过tensor/pipline/data等多种并行模式高效实现大模型的训练，通过并行模式的混合能够提升10%的数据吞吐量，在3072个GPU上训练1万亿参数模型，单GPU峰值效率达到52%.

Gspmd^[17]提出一种基于编译器的自动化机器学习并行系统，可以在单节点代码上通过添加编译指示实现自动化并行代码生成，在2048块TPUv3上达到了50%~62%的计算利用率.

DAPPL^[18]面向大模型提出了结合数据并行和流水线并行方法的并行训练框架，主要解决的问题是针对模型结构和硬件配置决策最优并行策略，如何调度数据流计算的不同流水线阶段.

Alpa^[19]针对分布式深度学习训练提出了算子内和算子间并行策略，通过系统化的方式将分布式并行策略的优化空间结构化，并在这个优化空间中寻找最优策略并实现自动化.Alpa以计算图为输入，输出并行方案，主要考虑如何划分子图和计算任务调度.

目前，面向数据流计算的相关研究大多是针对大模型和大规模并行系统，专注于如何切割数据流图并将其调度到各计算节点上，本文则主要针对sw26010pro的异构众核结构和普通深度学习模型，专注于单处理器内部的计算流程，通过算子内和算子间的两级并行策略，高效利用单处理器计算能力. 在后续工作中，swFLOWpro会在寻找最优并行策略以及调度模型的优化上加强研究.

2. swFLOWpro：新一代数据流计算系统

本节主要介绍国产异构众核处理器sw26010pro的结构特点，以及swFLOWpro的整体架构和工作流程.

2.1 sw26010pro架构

sw26010pro是一款国产异构众核处理器，它包含6个核组（core group , CG），核组之间通过片上环网互连，每个核组包含2种异构核心，一种是管理核心（management processing element，MPE），另一种是计算核心（computing processing elements, CPE），1个MPE和1个8×8的CPE组成1个异构计算阵列. 一般而言，MPE主要负责计算任务、全局内存和运算核心的管理；CPE负责计算任务的执行，每个CPE通过一个软件管理的片上便签存储器（LDM）来提升访存效率. sw26010pro结构如图1所示.

图 1 sw26010pro结构

Figure 1. The structure of sw26010pro

下载: 全尺寸图片幻灯片

sw26010pro采用SW64自主指令集设计. 其中， MPE具有32 KB L1指令缓存、32 KB L1数据高速缓存和512 KB L2高速缓存；CPE支持512 b的SIMD运算，支持双精度、单精度和半精度浮点及整数等多种数据类型的向量运算，每个CPE具有独立的指令缓存和片上LDM存储，其中LDM可以配置为L1数据缓存，也可以配置为用户管理的局存空间，支持通过DMA方式实现LDM和全局内存之间的数据传输；支持通过RMA方式实现不同CPE之间的LDM存储传输.

sw26010pro全处理器包含6个同构的核组，6核组之间可以共享全局内存. 通常情况下，1个进程运行在1个核组上，多个核组之间通过MPI消息进行通信，但这样会导致单进程可用的内存空间和计算能力都较小，频繁的MPI通信也会造成性能损失. 事实上，通过多核组的共享全局内存，可以结合多线程管理和核组资源分配，实现全片视角的统一编程，这样能大幅度提升单进程的可用内存空间和计算能力，减少进程间通信造成的性能损失.

与常规的处理器设计不同，sw26010pro将更多的硬件逻辑用于计算，从而最大程度地提升计算密度，精简的CPE核心设计导致了其计算能力很强，但访存能力较弱. sw26010pro提供了用户可以显式管理的LDM存储来弥补访存与计算能力不匹配的问题，从而支持用户充分挖掘异构众核的计算能力. 不过，这种设计模式就意味着程序的高效运行需要更加复杂的优化策略和更加全面的算法改造.

对于数据流计算和深度学习领域的编程用户来说，他们更关注的是数据流图的结构、模型的构造以及训练模型的超参数调整等上层算法设计，而非底层硬件细节和体系结构相关优化技术.

为此，swFLOWpro的主要设计目标就是构建国产异构众核处理器与用户之间的桥梁，提供可移植性强、功能丰富的编程接口，并将底层硬件细节对用户透明，实现自动化的众核并行加速.

2.2 swFLOWpro结构设计

swFLOWpro数据流计算系统的整体架构图如图2所示.swFLOWpro系统可以划分为2个子模块：前端模块和后端模块. 中间层由C-API桥接.

图 2 swFLOWpro整体架构

Figure 2. The overall architecture of swFLOWpro

下载: 全尺寸图片幻灯片

前端模块是一个支持多语言的编程环境，它提供基于数据流图的编程模型，方便用户使用TensorFlow 的Python和C++编程接口构造各种复杂的计算图，从而实现各种形态的模型搭建.

C-API是桥接前端模块和后端模块的中间层次，主要是通过SWIG（simplified wrapper and interface generator）机制支持前端多语言编程环境与C++实现的后端模块之间的通道.

为保证系统的易用性和提升深度学习程序的可移植性，swFLOWpro框架的前端模块和C-API复用了TensorFlow框架的相应模块，主要是为了保持对TensorFlow编程的兼容性.

后端模块则是体系结构相关的运行模块，也是swFLOWpro针对sw26010pro架构特点重点开发的模块.swFLOWpro的后端模块主要包括数据流图优化、运行时系统、算子（OP）实现层等子模块. 其中，数据流图优化模块支持面向异构众核处理器的混合精度训练优化和节点融合优化，混合精度训练优化在数据流图中插入数据类型转化节点，将单精度运算转换为效率更高、精度更低的半精度运算，而在更新参数节点等对精度要求更高的节点之前，再将半精度转化为单精度，从而支持混合精度训练；图节点融合优化则将多个图节点融合，形成更大的计算单元，减少内存管理开销，提升运行效率. 运行时系统主要负责计算图节点的管理、调度、执行以及内存分配，根据图中依赖关系依次执行各个节点. OP实现层则是针对sw26010pro的存储层次和结构特点，将OP的定义和执行解耦，通过独立的异构众核加速引擎（swHMAE）实现对关键性能OP的众核加速，这一部分将在2.3节中详细介绍.

2.3 异构众核加速引擎（swHMAE）

swHMAE是一个独立于swFLOWpro系统之外的独立模块，其设计目的是为数据流计算系统提供一个松耦合的、体系结构相关深度优化的核心计算加速框架. 框架整体结构如图3所示.

图 3 swHMAE 结构

Figure 3. The structure of swHMAE

下载: 全尺寸图片幻灯片

swHMAE提供了一系列性能关键计算的调用接口，这些接口在swFLOWpro的算子实现层进行调用，而其真正实现则集成于一个独立的动态库中.

swHMAE提供的这些接口是完全虚拟化的，仅用来描述要完成哪种运算和需要哪些参数，swHMAE可以向上支持不同的人工智能编程框架或数据流系统的图节点实现模块，向下则可以调用多种众核加速算法库，也可以集成用户自定义的众核算法，具有很好的可扩展性.

在swHMAE中，针对不同的计算类型，主要完成2方面的工作：1）收集核心计算的参数. 2）根据参数类型、参数特性及输入规模，判断是否适合使用众核加速，如不适合，则该API返回失败，swFLOWpro将调用默认的实现算法；否则，swHMAE将根据不同的参数类型和规模自适应地选择最优的异构众核加速算法.

swHMAE支持的核心计算类型涵盖了数据流计算常见的计算类型，核心计算类型既有深度学习领域的常见计算，例如卷积、矩阵乘、激活、归一化等，这类计算的众核加速主要是通过swDNN，swBLAS，Sw_OPs等第三方库来支持，又有一些更通用的数据流计算节点类型，如批量数据的基础运算、数据的Padding，tile，slice等访存操作，以及其他一些定制的计算类型.

swHMAE的工作原理算法如算法1所示：

算法1. swHMAE工作原理算法.

输入：计算类型OP-type，张量t₁,t₂,…，数据类型 data-type，常量参数params；

输出：计算结果张量t-results.

① if notSuitforMC（OP-type,params,t₁,t₂,…）

②　 return false；/* 如果该OP不适合众核加速返回false，执行swFLOWpro的默认计算模式 */

③ end if

④ timing_or_debug_this_op_start（）；

⑤ if OP-type $\in$ {SW_Conv,SW_Activate, SW_Pooling}

⑥ 　t-results = swDNN（OP-type,params,t₁,t₂,…）；

⑦ else if OP-type $\in$ {SW_Matmul}

⑧ 　t-results = swBLAS（OP-type,params, t₁,t₂,…）；

⑨ else

t-results=MC_accelete_op（OP-type,params, t₁,t₂,…）；

⑩ end if

⑪ timing_or_debug_this_op_end（）；

⑫ return true.

swHMAE是面向国产异构众核处理器的数据流计算后端，作为一个独立模块，它将关键计算的众核加速与数据流系统的整体框架解耦，既能够高效利用swDNN, swBLAS，sw_OPs等众核计算库，由于本身也集成了一系列众核优化算子，也能够对更多的核心计算进行众核加速.

swHMAE针对非计算密集类运算实现了众核加速算法，其主要思想是通过数据分割将运算任务分配到各CPE上执行，通过DMA数据传输机制将具有局部性的数据显式地搬运到CPE的片上内存LDM中，并通过2个数据传输缓冲的动态切换，实现数据传输与数据计算的并行操作，其算法思想如图4所示.

图 4 CPE双缓冲算法思想

Figure 4. Double buffer algorithm idea for CPE

下载: 全尺寸图片幻灯片

除此之外，swHMAE还可以通过多种方式对关键计算进行调试、错误定位和性能分析，进一步提升易用性.

swHMAE的松耦合和模块化设计使得用户可以更加方便地集成新的众核计算到swFLOWpro系统中去. 事实上，swHMAE还可以支持其他的数据流计算系统，其仅需要在原始系统中做极少量的修改.

3. 面向异构融合的两级并行策略

在异构融合的众核处理器上执行数据流图的基本流程为：MPE负责数据流图的生成、优化和调度管理；在执行过程中，将已满足执行需求的图节点分配到众核阵列上执行.

有2种任务分配方法可以考虑：1）将每个节点调度到1个CPE上，CPE阵列协同完成整个数据流图的执行过程；2）将CPE阵列视为一个整体部件，所有计算核心共同完成数据流图中的一个节点.

第1种任务分配方法与异构融合众核架构的适应性并不好，其主要原因有3点：1）单CPE的访存能力有限，其LDM的容量大小也很难承载一个完整的图节点计算逻辑，比如卷积、矩阵乘等常用算子，在单CPE上执行效率较差；2）数据流图的可并行性有限，考虑某些具有强相关性的数据流图，每个节点都依赖于上一个节点的计算结果，则程序在这种模式下执行的效率就会很差，因为大部分时间内CPE可能因为依赖另一个CPE的计算结果而处于等待状态；3）负载均衡问题，由于每个数据流图节点运算量相差较大，保证各计算核心的负载均衡也是个难以解决的问题.

本文主要采用第2种任务分配方法，也就是将CPE阵列视为整体部件，所有CPE协同完成一个图节点的执行过程，这样每个CPE的计算任务量都在可以接受的范围之内，而在每个图节点内部，主要通过数据分割的方式将输入数据映射到各个CPE上，这样能保证LDM空间够用和保证各计算节点的负载均衡性. 并且，由于并行发生在图节点内部，整体效率不会受限于数据流图本身的可并行性.

图5是在sw26010pro的单核组上运行一个数据流图的示例.

图 5 面向单核组的数据流调度示例

Figure 5. Dataflow scheduling example for single CG

下载: 全尺寸图片幻灯片

输入数据的后继图节点是Reshape，该节点是为了改变输入形状，属于功能类算子，所以将其调度到MPE上执行即可；其后的Matmul，Biasadd，Softmax都是计算密集的图节点，需要调度到CPE阵列上进行众核并行计算，例如，CPE在执行Matmul图节点的时候，首先将矩阵进行分块，每个CPE执行子矩阵乘法运算，再通过CPE阵列内部的RMA操作进行全局通信，获得原始矩阵的乘法运算结果.

sw26010pro异构众核芯片采用多核组设计，处理器内部包含6个同构的核组，每个核组都有一个MPE和一个8×8的CPE阵列. 因此，要在6核组结构上实现更高层次的并行.

在单核组内部，我们将1个图节点分配到1个MPE或者1个CPE阵列上执行，实现了低层次的图节点内并行；在基于全片视角的多核组上，利用6个等价队列分别维护由上层图计算过程产生的计算任务；在运行核组选择过程中采用Round-Robin的轮询调度策略；在计算任务选择中采用先入先出（FIFO）方法，进而支持高层次的图节点间并行. 这就是本节提出的两级并行策略，该策略能够充分适应sw26010pro的异构融合架构.

值得注意的是，图节点间的并行要求图节点之间没有数据依赖关系，但实际上一般单输入的数据流计算图可并行性并不高，如果将不同的图节点调度到不同核组上，由于图节点之间的数据依赖关系，会导致部分核组处于空闲状态，需要等待其他核组的计算结果才能开始计算.

为此，本文设计了一种图分裂优化方法，首先将输入数据进行平均分割，分割之后的每个输入都进行相同的数据流图执行流程，在输出结果时再进行归并，从而生成并行性更好的数据流计算图.

以图5的数据流图为例，将split值设置为2，经过图分裂之后的数据流图如图6所示.

图 6 经过图分裂之后的数据流调度示例

Figure 6. Dataflow scheduling example after graph splitting

下载: 全尺寸图片幻灯片

经过图分裂之后，数据并行输入到不同的数据流子图中，每个子图都是原数据流图的一个复制，各个子图之间没有强相关性，从而具有很好的可并行性，可以映射到不同的处理器分区上执行.

图分裂是一种与体系结构无关的图变换技术，分裂值split可以调整，以适应不同的硬件体系结构. 如果众核处理器集成更多的核组数，只需要提升分裂值，无需改变整体算法就能充分利用硬件计算资源.

在具体实现上，本文采用多线程机制来管理图节点的调度，根据核组数来确定线程个数. 在sw26010pro上会启动6个线程来执行数据流图，每个线程绑定在1个核组上运行，这样能保证各线程不存在资源冲突问题.

调度器将所有图节点组织成任务池，并记录每个节点的前继节点. 在执行过程中，一个图节点可能处于不可用、可用、执行中、完成中这4种状态的一种. 每种状态对应一个任务池.

初始情况下，将没有前继节点的图节点状态设置为“待命”，其余节点状态均设置为“不可用”. 线程函数从任务池里通过抢占方式获取一个图节点任务，如果该图节点的已处于可用状态（所有前继节点均已完成），则执行该节点，并将该节点状态设置为“执行中”，完成后则设置状态为“完成”. 值得注意的是，线程选择下一个执行节点时，优先从该节点的后继节点中选取，如果后继节点不可用，则从该节点前继节点的其他后继节点中选择. 这种搜索方法可以使得单个相对独立的数据流子图在一个线程内部完成.

图节点状态变换关系如图7所示.

图 7 图节点状态变换图

Figure 7. Graph node state transformation diagram

下载: 全尺寸图片幻灯片

4. 实　　验

本文选择6种典型神经网络模型作为数据流计算的输入，通过TensorFlow编程接口编写数据流程序，实现这些模型的训练过程，这些模型及其变种也是HPC+AI领域应用经常使用的模型. 具体模型信息如表1所示.

表 1 6种典型神经网络模型

Table 1. Six Typical Neural Network Models

典型模型	输入数据	参数量	计算量/GFLOP
Alexnet^[20]	227×227×3	61.1×10⁶	0.77
VGG16^[21]	224×224×3	138.36×10⁶	15.61
ResNet50^[22]	224×224×3	25.56×10⁶	4.14
ResNet101	224×224×3	44.55×10⁶	7.87
Inception3^[23]	299×299×3	21.16×10⁶	5.75
Inception4^[24]	299×299×3	41.22×10⁶	10.48

下载: 导出CSV

| 显示表格

测试硬件平台为sw26010pro处理器，其包含6个核组，6个MPE和384个CPE，全片主存空间大小为96 GB，每个CPE的片上高速缓存LDM大小为256 KB.

软件环境为swFLOWpro数据流计算系统、swHMAE核心计算加速引擎，以及swPython编程环境.

本文选择众核加速比ManyAccRatio作为主要的性能评价指标，其定义为：

$ManyAccRatio=\frac{MPE\_time}{CPE\_time}\times 100\% ,$

其中MPE_time表示在MPE主核上的运行时间，CPE_time表示在单核组阵列上的运行时间. 由于sw26010pro结构的特殊性，其与GPU，TPU等人工智能专用芯片的性能对比意义不大，通过众核加速比可以体现SwFlowpro在sw26010pro独特的异构融合结构上的适配性和优化效果.

4.1 swHMAE针对典型模型核心计算的众核加速效果

本文使用swFLOWpro构建了6种典型模型，并统计了模型中所有核心计算（数据流图节点）类型，选择7种典型核心计算类型，通过swHMAE引擎进行众核加速. 具体统计信息如表2所示.其中Conv2D, Conv2DBackpropFilter, Conv2DBackpropInput都是卷积类计算，Matmul是矩阵乘计算，Relu是激活类计算，Poolmax是池化类计算，ApplyGradientDescent是训练更新参数计算.

表 2 典型核心计算

Table 2. Typical Core Computing

核心计算类型	分类
Conv2D	SW_Conv
Conv2DBackpropFilter	SW_Conv
Conv2DBackpropInput	SW_Conv
Matmul	SW_Matmul
Relu	SW_Activate
Poolmax	SW_Pooling
ApplyGradientDescent	SW_OPs

下载: 导出CSV

| 显示表格

在表1的6种典型模型中，统计了典型核心计算在sw26010pro的单核组CPE上的运行时间，通过对比swFLOWpro未经众核优化的MPE运行时间，并获得众核加速比.详细测试数据如表3所示.通过计算，获得的各类型的典型核心计算众核加速比如图8所示.

表 3 典型模型中的典型核心计算测试时间

Table 3. Test Time of Typical Core Computing in Typical Models ms

模型	Conv2D	Conv2DBackpropFilter	Conv2DBackpropInput	Matmul	Relu	Poolmax	ApplyGradientDescent
Alexnet-MPE	153470	141700	51100	20260	352	340	528
Alexnet-CPE	611	1320	238	548	11	23	21
VGG16-MPE	1629210	1114340	718380	37430	5260	2120	1080
VGG16-CPE	3140	1910	2310	1120	119	172	50
ResNet50-MPE	250690	285090	152940	181	2170	373	250
ResNet50-CPE	652	1690	1690	9	56	24	14
ResNet101-MPE	578400	552900	303010	182	3410	372	449
ResNet101-CPE	1060	2200	2220	9	90	24	25
Inception3-MPE	489790	443940	216850	86	2140	3730	215
Inception3-CPE	1160	1480	1540	3	61	278	16
Inception4-MPE	1019110	1010510	472480	21	3610	6370	413
Inception4-CPE	2670	2960	3660	0.8	101	503	26

下载: 导出CSV

| 显示表格

卷积类运算是实验的6种典型模型的关键，也是swHMAE实现众核加速的重点运算.swHMAE会根据输入规模和相关参数，自适应选择swDNN库中最优的算法实现. 由图8实验结果可以看出，Conv2D的众核加速比达250~545，Conv2DBackpropFilter的众核加速比达107~583，Conv2DBackpropInput的众核加速比达90~310，加速效果良好.

图 8 不同典型模型的卷积类核心计算众核加速比

Figure 8. The many-core acceleration ratios of convolutional core computing for different typical models

下载: 全尺寸图片幻灯片

其他核心计算类型的众核加速比测试数据如图9所示.

图 9 不同典型模型的其他核心计算众核加速比

Figure 9. The many-core acceleration ratios of other core computing for different typical models

下载: 全尺寸图片幻灯片

针对矩阵乘类核心计算，swHMAE从swBLAS库中自适应选择众核算法. 测试表明，矩阵乘核心计算的众核加速比仅有26.1~38.7，由表3可以看出，本文选择的典型模型都是卷积类神经网络，矩阵乘的计算量很小，不能充分发挥CPE从核阵列的全部计算能力. 除此之外，swBLAS库中矩阵是按列优先模式存储，在接入模型时还需要先进行矩阵转置. 所以，矩阵乘的实际众核加速比效果远低于卷积类算子，在后续工作中可以针对矩阵转置进行优化.

针对Relu激活类运算，swHMAE通过swDNN库进行加速，众核加速比达到26.1~38.7.

除了计算密集类运算之外，模型中也会用到一些其他算子，这类算子虽然计算量小，但如果不进行众核优化，则会成为性能瓶颈. 如本文实验选择的更新参数操作（ApplyGradientDescent），是模型训练中常见的算子类型，但缺乏专属的算法库支持. 本文选择在swHMAE中直接集成其众核优化算法，实验表明众核加速比达13.9~25.2.

测试结果表明，在sw26010pro上，卷积类运算的众核加速比要远高于其他运算类型，这主要是因为国产异构众核的架构设计对于卷积这类计算密集类运算的适应性更好.

4.2 swFLOWpro+swHMAE针对典型模型训练的众核加速效果

本文使用swFLOWpro+swHMAE运行6种典型模型的训练过程，单步训练batch大小统一设置为32.

实验分别测试在sw26010pro的单MPE和单CPE阵列上的单步训练时间，并计算众核加速比. 测试数据如表4所示.

表 4 典型模型的单步训练测试数据

Table 4. Single Step Training Test Data of Typical Models

典型模型	MPE运行时间/s	CPE运行时间/s	众核加速比
Alexnet	379.8	3.1	123
VGG16	3525.5	10.2	346
ResNet50	973.6	8.2	119
ResNet101	1876.6	12.1	155
Inception3	1373.1	11.9	115
Inception4	2996.9	20.1	149

下载: 导出CSV

| 显示表格

由图10可见，VGG16模型的众核加速比最高，达到346，其余的模型加速比相差不大，在115~155之间.

图 10 典型模型的众核加速比

Figure 10. Muti-core acceleration ratios of typical models

下载: 全尺寸图片幻灯片

模型的性能与模型中各类型核心计算的性能紧密相关，由4.1节测试结果可知，在sw26010pro上，卷积类运算的众核加速比要远高于其他运算，所以卷积类运算占比较高的模型，在sw26010pro上的整体加速比也更高.

本文统计了在6种典型模型中，卷积类和非卷积类核心计算的运行时间占比，如表5所示. 这6种典型模型都属于卷积神经网络，它们的卷积类运算占比为82.5%~97.4%.

表 5 典型模型的卷积类和非卷积类核心运算占比

Table 5. Core Computing Proportion of Convolutional and Non-Convolutional of Typical Models %

典型模型	Conv2D	Conv2DBackpropFilter	Conv2DBackpropInput	非卷积类运算
Alexnet	11.5	27.2	54.8	6.5
VGG16	11.1	37.2	49.1	2.6
ResNet50	9.2	25.3	48	17.5
ResNet101	10.1	23.7	51.3	14.9
Inception3	11.8	31.6	40.5	16.1
Inception4	11.6	29.7	45.2	13.5

下载: 导出CSV

| 显示表格

表5中，VGG16的卷积类运算占比最高，达到了97.4%（11.1%+37.2%+49.1%），所以这个模型的众核加速比也最高，Alexnet的卷积类运算虽然占比高达93.5%（11.5%+27.2%+54.8%），但由于其卷积类运算的计算量较小，不能充分发挥sw26010pro的计算能力，所以整体众核加速比只有123.

实验表明，针对典型模型的训练过程，swFLOWpro+swHMAE比原始运行模型，特别是卷积类计算占比较高的模型（如实验中的VGG16）有显著的众核加速效果.

4.3 两级并行优化效果

我们将sw26010pro的单处理器（包含6个核组）作为一个执行单元，测试6种典型模型经过面向全片的两级并行优化之后的加速效果.

首先，测试不使用图分裂技术的6个核组并行加速效果，在这种模式下，6个核组的利用效率受限于不同模型构建出的数据流图本身的可并行性，测试数据如图11所示.

图 11 不使用图分裂技术的典型模型全片加速比

Figure 11. Full chip acceleration ratios of typical models without graph splitting technology

下载: 全尺寸图片幻灯片

加速比最高的是Inception4模型，达1.49；最低的是Alexnet模型，达1.19.这是因为，Inception模型本身的数据流图具有不错的可并行性，而模型结构简单的Alexnet模型可并行性并不好.

整体而言，在不使用图分裂的情况下，6核组的加速比较低，这是因为计算图的核心计算节点之间存在依赖关系，导致高层次的节点间并行不能同时进行计算，限制了并行效果，这也是本文提出图分裂技术的主要原因.

然后，使用图分裂技术进行优化，将split值分别设为2，4，6，并测试典型模型在全片6核组上运行对比单核组（split = 1）运行的加速比，测试数据如图12所示.

图 12 典型模型使用图分裂技术后的全片加速比

Figure 12. Full chip acceleration ratios of typical models with graph splitting

下载: 全尺寸图片幻灯片

图12中加速效果最好的是ResNet50（split = 6），加速比达4.96，并行效率达到了82.6%. 通过使用图分裂技术，选择合适的参数split，典型模型全片加速比能达到1.78~4.96.

图分裂技术结合面向异构融合的两级并行策略，在sw26010pro的多核组异构众核结构上取得了很好的并行效果，测试表明，图分裂技术针对典型模型的性能提升效果最高达到246%（ResNet50），最低也能达到50%（AlexNet）.

值得一提的是，从实验数据中也可以看出2个问题：1）sw26010pro的众核结构对模型和核心计算的计算量要求较高，一些轻量级的模型无法充分利用众核资源，所以Alexnet的单核组和6核组加速比都不理想. 2）图分裂结束也会带来图节点数量的大幅度增长，会增大内存需求，对于Inception这种本身就具有一定并行性的计算图，会出现图节点膨胀的现象，进而增大节点调度和分配的开销，所以其6核组并行加速比只有2.54~2.88，这也是图分裂技术目前存在的缺陷.

5. 结　　论

本文提出了一种面向新一代国产异构众核处理器的数据流计算系统swFLOWpro，该系统通过核心计算加速引擎swHMAE支持在国产异构众核处理器上的并行加速，并提出面向异构融合的两级并行策略，支持面向国产异构众核处理器全芯片视角的调度和并行方法. 实验表明，swHMAE针对卷积类核心计算，众核加速比达90~545，针对其他核心计算，众核加速比达13.9~38.7；swFLOWpro+swHMAE支持典型模型在sw26010pro上的高效执行，VGG16模型众核加速比可达346；通过面向异构融合的数据流调度策略，全片ResNet50加速比达4.96倍，6核组并行效率达到82.6%.

未来的工作主要包括3个方面：1）继续拓展swHMAE支持的核心计算类型；2）优化面向全片多核组的两级并行策略，优化图分裂算法，探索更高效的数据流调度算法，提升图节点间并行效率；3）完善系统，支持更多种类的神经网络模型高效运行，并引入新的优化算法.

作者贡献声明：肖谦提出了技术方案，实现系统和撰写论文；赵美佳和李名凡负责核心计算众核优化实现和论文完善；沈莉和陈俊仕负责数据流调度算法实现和优化；周文浩和王飞负责部分实验代码编写；安虹提出指导意见并修改论文.

图 1 论文统计

Figure 1. Paper statistics

下载: 全尺寸图片幻灯片

图 2 本文的概述

Figure 2. The overview of our paper

下载: 全尺寸图片幻灯片

图 3 各类型问题示例

Figure 3. Examples of various types of questions

下载: 全尺寸图片幻灯片

图 4 VideoQA与ImageQA模型对比

Figure 4. Comparison of VideoQA and ImageQA models

下载: 全尺寸图片幻灯片

图 5 主流的VideoQA模型年历表概览

Figure 5. Overview of the mainstream VideoQA model almanacs

下载: 全尺寸图片幻灯片

图 6 VideoQA模型处理流程

Figure 6. VideoQA model processing flow

下载: 全尺寸图片幻灯片

图 7 注意力计算的3个阶段

Figure 7. Three stages of attention calculation

下载: 全尺寸图片幻灯片

图 8 FVTA和传统注意力的比较^[61]

Figure 8. Comparison of FVTA and traditional attention^[61]

下载: 全尺寸图片幻灯片

图 9 PSAC模型结构^[67]

Figure 9. The structure of PSAC model ^[67]

下载: 全尺寸图片幻灯片

图 10 MSAN模型的关键模块^[71]

Figure 10. Key modules of MSAN model^[71]

下载: 全尺寸图片幻灯片

图 11 异构记忆增强多模态注意力模型^[84]

Figure 11. Heterogeneous memory enhanced multimodal attention model^[84]

下载: 全尺寸图片幻灯片

图 12 Bridge2Answer方法的图交互部分^[94]

Figure 12. The graph interaction part of Bridge2Answer method^[94]

下载: 全尺寸图片幻灯片

图 13 MSPAN网络结构^[99]

Figure 13. MSPAN network structure^[99]

下载: 全尺寸图片幻灯片

图 14 (2.5+1)D视频问答推理流程示意图^[108]

Figure 14. The schematic illustration of (2.5+1)D VideoQA reasoning pipeline^[108]

下载: 全尺寸图片幻灯片

图 15 流行的视频和语言学习范式和 CLIPBERT之间的比较^[120]

Figure 15. Comparison between popular video-and-language learning paradigm and CLIPBERT^[120]

下载: 全尺寸图片幻灯片

图 16 部分数据集示例

Figure 16. Some examples of datasets

下载: 全尺寸图片幻灯片

图 17 需要视觉和常识知识来回答的问题示例^[166]

Figure 17. Examples of questions that require visual and common sense knowledge to answer^[166]

下载: 全尺寸图片幻灯片

表 1 VideoQA综述工作对比

Table 1 Comparison of VideoQA Survey Works

综述工作	数据集		方法
综述工作	最新年份	个数	最新年份	个数	注意力	记忆网络	图网络	Transformer/BERT	预训练
Patel等人^[8]	2020	18	2020	26	√	√	×	×	×
Khurana等人^[9]	2019	11	2020	22	√	√	√	×	×
Sun等人^[10]	2019	11	2020	41	√	√	×	×	×
本文	2022	30	2022	83	√	√	√	√	√
注：“√”表示使用；“×”表示未使用.

下载: 导出CSV

表 2 各数据集指标对比

Table 2 Comparison of Indicators of Each Data Set

数据集	年份	数据源	视频数	片段数	平均长度/s	问答对	问答类型	问答生成
MovieQA^[80]	2016	电影	408	6771	202	14944	选择题	人工
LSMDC-QA^[143]	2017	M-VAD/MPII-MD	202	118081	200	118114	选择题	人工
MovieFIB^[139]	2017	LSMDC2016	180	118507	4.1	348998	填空题	自动
YouTube2Text-QA^[55]	2017	YouTube2Text	1987	1987	40	99421	选择题/开放问题	自动
TGIF-QA^[45]	2017	TGIF		71741	3	165165	选择题/开放问题	人工/自动
MSRVTT-QA^[48]	2017	MSRVTT	7000	10000	15	243000	开放问题	自动
MSVD-QA^[48]	2017	MSVD	1970	1970	10	50505	开放问题	自动
Video-QA^[79]	2017	在线网络视频	18100	18100	90	175076	开放问题	自动
MarioQA^[140]	2017	游戏视频				187757	选择题	自动
PororoQA^[31]	2017	卡通视频	171	16066		8913	选择题	人工
TVQA^[69]	2018	电视剧	925	21793	76	152545	选择题	人工
SVQA^[59]	2018	Unity3D生成	12000	12000		118680	开放问题	自动
TVQA+^[52]	2019	TVQA	279	4198	60~90	29383	选择题	人工
KnowIT VQA^[137]	2019	电视剧	207	12087	20	24000	选择题	人工
Activitynet-QA^[142]	2019	ActivityNet	5800	5800	180	58000	开放问题	人工
EgoVQA^[144]	2019	IU Multiview	16	520	20~100	600	选择题/开放问题	人工
Social-IQ^[145]	2019	YouTube	1250	1250		7500	选择题	人工
DramaQA^[146]	2020	电视剧	18	23928	3.6	17983	选择题	人工
LifeQA^[147]	2020	YouTube	59	275	74	2326	选择题	人工
Tutorial-VQA^[148]	2020	网络教学视频	76	408		6195	开放问题	人工
How2QA^[117]	2020	HowTo100M/TV	9035	22000	60	44007	选择题	人工
Env-QA^[53]	2021	模拟器生成		23261		85072	选择题	自动
CLEVRER^[132]	2021	模拟器生成	20000	20000	5	300000	选择题/开放问题	自动
TrafficQA^[136]	2021	交通视频	10080	10080		62535	选择题	人工
NExT-QA^[149]	2021	YFCC-100M	5440	5440	44	52044	选择题/开放问题	人工
AGQA^[150]	2021	Action genome	9601	9601	30	36000000	选择题/开放问题	自动
STAR^[151]	2021	Charades		22000	30	60000	选择题	自动
Fill-in-the-Blank^[36]	2022	VaTeX	28000	28000	10	28000	填空题	人工
CRAFT^[141]	2022	模拟器生成	9917	57524	10	57524	选择题	自动
EgoTaskQA^[152]	2022	LEMMA	2000	2000	45	40000	选择题/开放问题	自动

下载: 导出CSV

表 3 主流模型在MovieQA上的性能表现

Table 3 Performance of Mainstream Models on MovieQA %

模型	验证集准确率	测试集准确率
DEMN^[31]	44.7	30.0
RWMN^[32]	38.7	36.3
FVTA^[61]	41.0	37.3
LMN^[35]	42.5	39.0
MDAM^[68]		41.4
PAMN^[85]	43.3	42.5
WikiWord Embedding^[153]	50.0	47.0

下载: 导出CSV

表 4 主流模型在TVQA上的性能表现

Table 4 Performance of Mainstream Models on TVQA %

模型	验证集准确率	测试集准确率
文献[69]	65.85	66.46
PAMN^[85]	66.38	66.77
Multi-task^[51]	66.22	67.05
STAGE^[52]	70.50	70.23
MSAN^[71]	70.79	71.13
BERT Rep^[102]	72.35	73.06
DCM&FSG^[73]	74.20	74.09
iPerceive^[74]	76.97	75.15
SP&CRL^[115]	76.23	76.15

下载: 导出CSV

表 5 主流模型在TGIF-QA上的性能表现

Table 5 Performance of Mainstream Models on TGIF-QA

模型	重复动作 /%	状态转换 /%	帧问答 /%	计数损失
ST-VQA^[45]	60.8	67.1	49.3	4.28
Co-Mem^[34]	68.2	74.3	51.5	4.10
PSAC^[67]	70.4	76.9	55.7	4.27
LAD-Net^[66]	69.9	78.4	57.5	4.32
STA^[65]	72.3	79.0	56.6	4.25
Jin等人^[72]	72.7	80.9	57.1	4.17
HME^[84]	73.9	77.8	53.8	4.02
L-GCN^[91]	74.3	81.1	56.3	3.95
HGA^[92]	75.4	81.0	55.1	4.09
HCRN^[129]	75.0	81.4	55.9	3.82
HOSTR^[130]	75.0	83.0	58.0	3.65
FAM^[81]	75.4	79.2	56.9	3.79
QueST^[60]	75.9	81.0	59.7	4.19
Bridge2Answer^[94]	75.9	82.6	57.5	3.71
TPT^[109]	76.6	81.6	57.8	3.63
HAIR^[101]	77.8	82.3	60.0	3.88
MSPAN^[99]	78.4	83.3	59.7	3.57
HQGA^[131]	76.9	85.6	61.3
CoCo-BERT^[123]	78.3	85.6	61.1	3.78
SiaSamRea^[121]	79.7	85.3	60.2	3.61
PGAT^[100]	80.6	85.7	61.1	3.96
CLIPBERT^[120]	82.8	87.8	60.3
VGNMN^[27]	84.5	88.7	74.7	2.65
VIOLET^[126]	92.5	95.7	68.9
MERLOT^[118]	94.0	96.2	69.5

下载: 导出CSV

表 6 主流模型在MSRVTT-QA和MSVD-QA上的性能表现

Table 6 Performance of Mainstream Models on MSRVTT-QA and MSVD-QA %

模型	MSRVTT-QA						MSVD-QA
模型	What	Who	How	When	Where	All	What	Who	How	When	Where	All
E-VQA^[48]	18.9	38.7	83.5	70.5	29.2	26.4	9.7	42.2	83.8	72.4	53.6	23.3
E-SA^[48]	22.0	41.6	79.6	73.1	33.2	29.3	15.0	45.1	83.8	65.5	32.2	27.6
E-MN^[48]	23.4	41.8	83.7	70.8	27.6	30.4	12.9	46.5	80.3	70.7	50.0	26.7
AMU^[48]	26.2	43.0	80.2	72.5	30.0	32.5	20.6	47.5	83.5	72.4	53.6	32.0
HRA^[56]						35.1						34.4
HME^[84]	26.5	43.6	82.4	76	28.6	33	22.4	50.1	73	70.7	42.9	33.7
L-GCN^[91]												34.3
Jin等人^[72]	29.5	45.0	83.2	74.7	42.4	35.4	24.2	49.5	83.8	74.1	53.6	35.0
QueST^[60]	27.9	45.6	83	75.7	31.6	34.6	24.5	52.9	79.1	72.4	50	36.1
FAM^[81]	26.9	43.9	82.8	70.6	31.1	33.2	23.1	51.6	82.2	71.4	51.9	34.5
SSML^[122]						35.1						35.1
TSN^[33]	27.9	46.1	84.1	77.8	37.6	35.4	25.0	51.3	83.8	78.4	59.1	36.7
HGA^[92]	29.2	45.7	83.5	75.2	34.0	35.5	23.5	50.4	83.0	72.4	46.4	34.7
MHMAN^[86]	28.7	47.1	85.1	77.1	35.2	35.6	23.3	50.7	84.1	72.4	53.6	34.6
HCRN^[129]						35.6						36.1
ActBERT^[37]	29.4	45.6	79.8	76.7	36.4	35.5	28.7	53.8	80.0	70.7	46.4	39.0
HOSTR^[130]						35.9						39.4
Bridge2Answer^[94]						36.9						37.2
OCRL+LOGNet^[97]						36.0						38.2
HAIR^[101]						37.5						36.9
CLIPBERT^[120]						37.4
PGAT^[100]						38.1						39.0
TPT^[109]						38.5						37.7
MSPAN^[99]	31.9	47.2	83.2	77.5	38.4	37.8	31.0	53.8	77.0	72.1	53.6	40.3
HQGA^[131]	32.5	48.9	81.5	78.3	38.4	38.6	30.4	57.2	76.2	75.9	32.1	41.2
CoMVT^[124]						39.5						42.6
SiaSamRea^[121]						41.6						45.5
VQA-T^[116]						41.5						46.3
VIOLET^[126]						43.9						47.9
LiVLR^[98]	50.3	77.1	94.2	81.3	48.4	59.4

下载: 导出CSV

参考文献(174)

[1]	俞俊,汪亮,余宙. 视觉问答技术研究[J]. 计算机研究与发展,2018,55(9):1946−1958 doi: 10.7544/issn1000-1239.2018.20180168 Yu Jun, Wang Liang, Yu Zhou. Research on visual question answering technology[J]. Journal of Computer Research and Development, 2018, 55(9): 1946−1958 (in Chinese) doi: 10.7544/issn1000-1239.2018.20180168
[2]	Antol S, Agrawal A, Lu Jiasen, et al. VQA: Visual question answering[C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2015: 2425−2433
[3]	Yang Zichao, He Xiaodong, Gao Jianfeng, et al. Stacked attention networks for image question answering[C]//Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 21−29
[4]	Qiao Tingting, Dong Jianfeng, Xu Duanqing. Exploring human-like attention supervision in visual question answering[C]//Proc of the 32nd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2018: 7300−7307
[5]	Yu Zhou, Yu Jun, Cui Yuhao, et al. Deep modular co-attention networks for visual question answering[C]//Proc of the 32nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 6281−6290
[6]	Luo Haozheng, Qin Ruiyang. Open-ended multi-modal relational reason for video question answering[J]. arXiv preprint, arXiv: 2012. 00822, 2020
[7]	包希港,周春来,肖克晶,等. 视觉问答研究综述[J]. 软件学报,2021,32(8):2522−2544 doi: 10.13328/j.cnki.jos.006215 Bao Xigang, Zhou Chunlai, Xiao Kejing, et al. Review of visual question answering research[J]. Journal of Software, 2021, 32(8): 2522−2544 (in Chinese) doi: 10.13328/j.cnki.jos.006215
[8]	Patel D, Parikh R, Shastri Y. Recent advances in video question answering: A review of datasets and methods [C]//Proc of the Int Conf on Pattern Recognition. Berlin: Springer, 2021: 339−356
[9]	Khurana K, Deshpande U. Video question-answering techniques, benchmark datasets and evaluation metrics leveraging video captioning: A comprehensive survey [J]. IEEE Access, 2021, 9: 43799−43823
[10]	Sun Guanglu, Liang Lili, Li Tianlin, et al. Video question answering: A survey of models and datasets[J]. Mobile Networks and Applications, 2021, 26(5): 1904−1937 doi: 10.1007/s11036-020-01730-0
[11]	Mnih V, Heess N, Graves A. Recurrent models of visual attention [C]// Proc of the 27th Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2014: 2204−2212
[12]	Weston J, Chopra S, Bordes A. Memory networks[J]. arXiv preprint, arXiv: 1410. 3916, 2014
[13]	Scarselli F, Gori M, Tsoi A C, et al. The graph neural network model[J]. IEEE Transactions on Neural Networks, 2008, 20(1): 61−80
[14]	Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks[J]. arXiv preprint, arXiv: 1609. 02907, 2016
[15]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [C]// Proc of the 30th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2017: 5998−6008
[16]	Devlin J, Chang Mingwei, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding [C]//Proc of the 17th Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2019: 4171−4186
[17]	Ren Shaoqing, He Kaiming, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(6): 1137−1149
[18]	Deng Jia, Dong Wei, Socher R, et al. ImageNet: A large-scale hierarchical image database [C]//Proc of the 22nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2009: 248−255
[19]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint, arXiv: 1409. 1556, 2014
[20]	Szegedy C, Liu Wei, Jia Yangqing, et al. Going deeper with convolutions [C]//Proc of the 28th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 1−9
[21]	He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition [C]//Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 770−778
[22]	Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2015: 4489−4497
[23]	Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset [C]//Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 6299−6308
[24]	Xie Saining, Sun Chen, Huang Jonathan, et al. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification [C]//Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 305−321
[25]	Hara K, Kataoka H, Satoh Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? [C]//Proc of the 31st IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 6546−6555
[26]	Feichtenhofer C, Fan Haoqi, Malik J, et al. SlowFast networks for video recognition [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 6202−6211
[27]	Le H, Chen N F, Hoi S C H. VGNMN: Video-grounded neural module network to video-grounded language tasks[J]. arXiv preprint, arXiv: 2104. 07921, 2021
[28]	Shah A, Lin T H, Wu Shijie. Triple attention network architecture for MovieQA[J]. arXiv preprint, arXiv: 2111. 09531, 2021
[29]	Aytar Y, Vondrick C, Torralba A. SoundNet: Learning sound representations from unlabeled video [C]//Proc of the 29th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2016: 892−900
[30]	Kumar A, Khadkevich M, Fügen C. Knowledge transfer from weakly labeled audio using convolutional neural network for sound events and scenes [C]//Proc of the 44th IEEE Int Conf on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2018: 326−330
[31]	Kim K M, Heo M O, Choi S H, et al. Deepstory: Video story QA by deep embedded memory networks [C]//Proc of the 26th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2017: 2016−2022
[32]	Na S, Lee S, Kim J, et al. A read-write memory network for movie story understanding [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 677−685
[33]	Yang Tianhao, Zha Zhengjun, Xie Hongtao, et al. Question-aware tube-switch network for video question answering [C]//Proc of the 27th ACM Int Conf on Multimedia. New York: ACM, 2019: 1184−1192
[34]	Gao Jiyang, Ge Runzhou, Chen Kan, et al. Motion-appearance co-memory networks for video question answering [C]//Proc of the 31st IEEE CONF on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 6576−6585
[35]	Wang Bo, Xu Youjiang, Han Yahong, et al. Movie question answering: Remembering the textual cues for layered visual contents [C]//Proc of the 32nd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2018: 7380−7387
[36]	Castro S, Wang Ruoyao, Huang Pingxuan, et al. FIBER: Fill-in-the-blanks as a challenging video understanding evaluation framework [C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2022: 2925−2940
[37]	Zhu Linchao, Yang Yi. Actbert: Learning global-local video-text representations [C]//Proc of the 33rd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 8746−8755
[38]	Mikolov T, Chen Kai, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint, arXiv: 1301. 3781, 2013
[39]	Pennington J, Socher R, Manning C D. GloVe: Global vectors for word representation [C]//Proc of the 2014 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2014: 1532−1543
[40]	Kiros R, Zhu Yukun, Salakhutdinov R R, et al. Skip-Thought vectors [C]//Proc of the 28th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2015: 3294−3302
[41]	Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735−1780 doi: 10.1162/neco.1997.9.8.1735
[42]	Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation [C]//Proc of the 2014 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2014: 1724−1734
[43]	Zhao Zhou, Lin Jinghao, Jiang Xinghua, et al. Video question answering via hierarchical dual-level attention network learning [C]//Proc of the 25th ACM Int Conf on Multimedia. New York: ACM, 2017: 1050−1058
[44]	Xue Hongyang, Chu Wenqing, Zhao Zhou, et al. A better way to attend: Attention with trees for video question answering[J]. IEEE Transactions on Image Processing, 2018, 27(11): 5563−5574 doi: 10.1109/TIP.2018.2859820
[45]	Jang Y, Song Y, Yu Y, et al. TGIF-QA: Toward spatio-temporal reasoning in visual question answering [C]//Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 2758−2766
[46]	Falcon A, Lanz O, Serra G. Data augmentation techniques for the video question answering task [C]//Proc of the 16th European Conf on Computer Vision. Berlin: Springer, 2020: 511−525
[47]	Mazaheri A, Zhang Dong, Shah M. Video fill in the blank using LR/RL LSTMs with spatial-temporal attentions [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 1407−1416
[48]	Xu Dejing, Zhao Zhou, Xiao Jun, et al. Video question answering via gradually refined attention over appearance and motion [C]//Proc of the 25th ACM Int Conf on Multimedia. New York: ACM, 2017: 1645−1653
[49]	Chao Guanlin, Rastogi A, Yavuz S, et al. Learning question-guided video representation for multi-turn video question answering [C]//Proc of the 20th Annual SIGDIAL Meeting on Discourse and Dialogue. Stroudsburg, PA: ACL, 2019: 215−225
[50]	Zhao Zhou, Zhang Zhu, Xiao Shuwen, et al. Open-ended long-form video question answering via adaptive hierarchical reinforced networks [C]//Proc of the 27th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2018: 3683−3689
[51]	Kim J, Ma M, Kim K, et al. Gaining extra supervision via multi-task learning for multi-modal video question answering [C]//Proc of the 2019 Int Joint Conf on Neural Networks. Piscataway, NJ: IEEE, 2019: 1−8 Kim J,Ma M,Kim K,et al. Gaining extra supervision via multi-task learning for multi-modal video question answering [C]//Proc of the 2019 Int Joint Conf on Neural Networks. Piscataway,NJ:IEEE,2019:1−8
[52]	Lei Jie, Yu Licheng, Berg T L, et al. TVQA+: Spatio-temporal grounding for video question answering [C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 8211−8225
[53]	Gao Difei, Wang Ruiping, Bai Ziyi, et al. Env-QA: A video question answering benchmark for comprehensive understanding of dynamic environments [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 1675−1685
[54]	Yu Y, Kim J, Kim G. A joint sequence fusion model for video question answering and retrieval [C]//Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 471−487
[55]	Ye Yunan, Zhao Zhou, Li Yimeng, et al. Video question answering via attribute-augmented attention network learning [C]//Proc of the 40th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2017: 829−832
[56]	Chowdhury M I H, Nguyen K, Sridharan S, et al. Hierarchical relational attention for video question answering [C]//Proc of the 25th IEEE Int Conf on Image Processing. Piscataway, NJ: IEEE, 2018: 599−603
[57]	Zhao Zhou, Jiang Xinghua, Cai Deng, et al. Multi-turn video question answering via multi-stream hierarchical attention context network [C]//Proc of the 27th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2018: 3690−3696
[58]	Zhao Zhou, Yang Qifan, Cai Deng, et al. Video question answering via hierarchical spatio-temporal attention networks [C]//Proc of the 26th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2017: 3518−3524
[59]	Song Xiaomeng, Shi Yucheng, Chen Xin, et al. Explore multi-step reasoning in video question answering [C]//Proc of the 26th ACM Int Conf on Multimedia. New York: ACM, 2018: 239−247
[60]	Jiang Jianwen, Chen Ziqiang, Lin Haojie, et al. Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering [C]//Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 11101−11108
[61]	Liang Junwei, Jiang Lu, Cao Liangliang, et al. Focal visual-text attention for visual question answering [C]//Proc of the 31st IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 6135−6143
[62]	Yu Zhou, Yu Jun, Fan Jianping, et al. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 1821−1830
[63]	Xue Hongyang, Zhao Zhou, Cai Deng. Unifying the video and question attentions for open-ended video question answering[J]. IEEE Transactions on Image Processing, 2017, 26(12): 5656−5666 doi: 10.1109/TIP.2017.2746267
[64]	Chu Wenqing, Xue Hongyang, Zhao Zhou, et al. The forgettable-watcher model for video question answering[J]. Neurocomputing, 2018, 314: 386−393 doi: 10.1016/j.neucom.2018.06.069
[65]	Gao Lianli, Zeng Pengpeng, Song Jingkuan, et al. Structured two-stream attention network for video question answering [C]//Proc of the 33rd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2019: 6391−6398
[66]	Li Xiangpeng, Gao Lianli, Wang Xuanhan, et al. Learnable aggregating net with diversity learning for video question answering [C]//Proc of the 27th ACM Int Conf on Multimedia. New York: ACM, 2019: 1166−1174
[67]	Li Xiangpeng, Song Jingkuan, Gao Lianli, et al. Beyond RNNs: Positional self-attention with co-attention for video question answering [C]//Proc of the 33rd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2019: 8658−8665
[68]	Kim K M, Choi S H, Kim J H, et al. Multimodal dual attention memory for video story question answering [C]//Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 673−688
[69]	Lei Jie, Yu Licheng, Bansal M, et al. TVQA: Localized, compositional video question answering [C]//Proc of the 2018 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2018: 1369−1379
[70]	Li Fangtao, Bai Ting, Cao Chenyu, et al. Relation-aware hierarchical attention framework for video question answering[C]//Proc of the 2021 Int Conf on Multimedia Retrieval. New York: ACM, 2021: 164−172
[71]	Kim J, Ma M, Pham T, et al. Modality shifting attention network for multi-modal video question answering [C]//Proc of the 33rd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10106−10115
[72]	Jin Weike, Zhao Zhou, Gu Mao, et al. Multi-interaction network with object relation for video question answering [C]//Proc of the 27th ACM Int Conf on Multimedia. New York: ACM, 2019: 1193−1201
[73]	Kim H, Tang Zineng, Bansal M. Dense-caption matching and frame-selection gating for temporal localization in VideoQA [C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 4812−4822
[74]	Chadha A, Arora G, Kaloty N. iPerceive: Applying common-sense reasoning to multi-modal dense video captioning and video question answering [C]//Proc of the 2021 IEEE Winter Conf on Applications of Computer Vision. Piscataway, NJ: IEEE, 2021: 1−13
[75]	Seo M, Kembhavi A, Farhadi A, et al. Bidirectional attention flow for machine comprehension [C/OL]//Proc of the Int Conf on Learning Representations. 2017 [2022-01-10].https://openreview.net/forum? id=HJ0UKP9ge
[76]	Yu A W, Dohan D, Luong M T, et al. Qanet: Combining local convolution with global self-attention for reading comprehension [C/OL]//Proc of the Int Conf on Learning Representations. 2018 [2022-01-10].https://openreview.net/forum?id=B14TlG-RW
[77]	Veličković P, Cucurull G, Casanova A, et al. Graph attention networks [C/OL]//Proc of the Int Conf on Learning Representations. 2018 [2022-01-12].https://openreview.net/forum?id=rJXMpikCZ
[78]	Sukhbaatar S, Weston J, Fergus R. End-to-end memory networks [C]//Proc of the 28th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2015: 2440−2448
[79]	Zeng K H, Chen T H, Chuang C Y, et al. Leveraging video descriptions to learn video question answering [C]//Proc of the 31st AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2017: 4334−4340
[80]	Tapaswi M, Zhu Yukun, Stiefelhagen R, et al. MovieQA: Understanding stories in movies through question-answering [C]//Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 4631−4640
[81]	Cai Jiayin, Yuan Chun, Shi Cheng, et al. Feature augmented memory with global attention network for VideoQA [C]//Proc of the 30th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2021: 998−1004
[82]	Kumar A, Irsoy O, Ondruska P, et al. Ask me anything: Dynamic memory networks for natural language processing [C]//Proc of the 33rd Int Conf on Machine Learning. New York: ACM, 2016: 1378−1387
[83]	Xiong Caiming, Merity S, Socher R. Dynamic memory networks for visual and textual question answering [C]//Proc of the 33rd Int Conf on Machine Learning. New York: ACM, 2016: 2397−2406
[84]	Fan Chenyou, Zhang Xiaofan, Zhang Shu, et al. Heterogeneous memory enhanced multimodal attention model for video question answering [C]//Proc of the 32nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 1999−2007
[85]	Kim J, Ma M, Kim K, et al. Progressive attention memory network for movie story question answering [C]//Proc of the 32nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 8337−8346
[86]	Yu Ting, Yu Jun, Yu Zhou, et al. Long-term video question answering via multimodal hierarchical memory attentive networks[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 31(3): 931−944
[87]	Fukui A, Park D H, Yang D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding [C]//Proc of the 2020 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2016: 457−468
[88]	Kim J H, On K W, Lim W, et al. Hadamard product for low-rank bilinear pooling [C/OL]//Proc of the Int Conf on Learning Representations. 2017 [2022-01-20].https://openreview.net/forum?id= r1rhWnZkg
[89]	Wang Zhichun, Lv Qingsong, Lan Xiaohan, et al. Cross-lingual knowledge graph alignment via graph convolutional networks [C]//Proc of the 2018 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2018: 349−357
[90]	Fan Wenqi, Ma Yao, Li Qing, et al. Graph neural networks for social recommendation [C]//Proc of the 2019 World Wide Web Conf. New York: ACM, 2019: 417−426
[91]	Huang Deng, Chen Peihao, Zeng Runhao, et al. Location-aware graph convolutional networks for video question answering [C]//Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 11021−11028
[92]	Jiang Pin, Han Yahong. Reasoning with heterogeneous graph alignment for video question answering [C]//Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 11109−11116
[93]	Seo A, Kang G C, Park J, et al. Attend what you need: Motion-appearance synergistic networks for video question answering [C]//Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2021: 6167–6177
[94]	Park J, Lee J, Sohn K. Bridge to Answer: Structure-aware graph interaction network for video question answering [C]//Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 15526−15535
[95]	Wang Jianyu, Bao Bingkun, Xu Changsheng. DualVGR: A dual-visual graph reasoning unit for video question answering [J]. IEEE Transactions on Multimedia, 2021, 24: 3369−3380
[96]	Wang Xiao, Zhu Meiqi, Bo Deyu, et al. AM-GCN: Adaptive multi-channel graph convolutional networks [C]//Proc of the 26th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining. New York: ACM, 2020: 1243−1253
[97]	Dang L H, Le T M, Le V, et al. Object-centric representation learning for video question answering [C/OL]//Proc of the 2021 Int Joint Conf on Neural Networks. Piscataway, NJ: IEEE, 2021 [2022-01-22].https://arxiv.org/abs/2104.05166
[98]	Jiang Jingjing, Liu Ziyi, Zheng Nanning, et al. LiVLR: A lightweight visual-linguistic reasoning framework for video question answering [J/OL]. IEEE Transactions on Multimedia, 2022 [2022-02-22].https://github.com/jingjing12110/LiVLR-VideoQA
[99]	Guo Zhicheng, Zhao Jiaxuan, Jiao Licheng, et al. Multi-scale progressive attention network for video question answering [C]//Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing (Volume 2: Short Papers). Stroudsburg, PA: ACL, 2021: 973−978
[100]	Peng Liang, Yang Shuangji, Bin Yi, et al. Progressive graph attention network for video question answering [C]//Proc of the 29th ACM Int Conf on Multimedia. New York: ACM, 2021: 2871−2879
[101]	Liu Fei, Liu Jing, Wang Weining, et al. HAIR: Hierarchical visual-semantic relational reasoning for video question answering [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 1698−1707
[102]	Yang Zekun, Garcia N, Chu Chenhui, et al. Bert representations for video question answering [C]//Proc of the 2020 IEEE Winter Conf on Applications of Computer Vision. Piscataway, NJ: IEEE, 2020: 1556−1565
[103]	Urooj Khan A, Mazaheri A, da Vitoria Lobo N, et al. MMFT-BERT: Multimodal fusion transformer with BERT encodings for visual question answering [C]//Proc of the 2020 Conf on Empirical Methods in Natural Language Processing(Findings). Stroudsburg, PA: ACL, 2020: 4648−4660
[104]	Garcia N, Nakashima Y. Knowledge-based video question answering with unsupervised scene descriptions [C]//Proc of the 16th European Conf on Computer Vision. Berlin: Springer, 2020: 581−598
[105]	Engin D, Schnitzler F, Duong N Q K, et al. On the hidden treasure of dialog in video question answering [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 2064−2073
[106]	Sadhu A, Chen Kan, Nevatia R. Video question answering with phrases via semantic roles [C]//Proc of the 19th Int Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2021: 2460−2478
[107]	Ganesan A, Pal D, Muthuraman K, et al. Video based contextual question answering[J]. arXiv preprint, arXiv: 1804. 07399, 2018
[108]	Cherian A, Hori C, Marks T K, et al. (2.5+ 1) D spatio-temporal scene graphs for video question answering [C]//Proc of the 36th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2022: 444−453
[109]	Peng Min, Wang Chongyang, Gao Yuan, et al. Temporal pyramid transformer with multimodal interaction for video question answering[J]. arXiv preprint, arXiv: 2109. 04735, 2021
[110]	Tan Hao, Bansal M. Lxmert: Learning cross-modality encoder representations from transformers [C]//Proc of the 2019 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2019: 5100−5111
[111]	Sun Chen, Myers A, Vondrick C, et al. Videobert: A joint model for video and language representation learning [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 7464−7473
[112]	Chen Xinlei, Fang Hao, Lin T Y, et al. Microsoft COCO captions: Data collection and evaluation server[J]. arXiv preprint, arXiv: 1504. 00325, 2015
[113]	Krishna R, Zhu Yuke, Groth O, et al. Visual Genome: Connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123(1): 32−73 doi: 10.1007/s11263-016-0981-7
[114]	Miech A, Zhukov D, Alayrac J B, et al. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 2630−2640
[115]	Kim S, Jeong S, Kim E, et al. Self-supervised pre-training and contrastive representation learning for multiple-choice video QA[J]. arXiv preprint, arXiv: 2009. 08043, 2020
[116]	Yang A, Miech A, Sivic J, et al. Just ask: Learning to answer questions from millions of narrated videos [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 1686−1697
[117]	Li Linjie, Chen Y C, Cheng Yu, et al. HERO: Hierarchical encoder for video+ language omni-representation pre-training [C]//Proc of the 2020 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 2046−2065
[118]	Zellers R, Lu Ximing, Hessel J, et al. MERLOT: Multimodal neural script knowledge models [C]//Proc of the 34th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2021: 23634−23651
[119]	Liu Yinhan, Ott M, Goyal N, et al. RoBERTa: A robustly optimized bert pretraining approach [C/OL]//Proc of the Int Conf on Learning Representations. 2020 [2022-01-25].https://openreview.net/forum?id= SyxS0T4tvS
[120]	Lei Jie, Li Linjie, Zhou Luowei, et al. Less is more: Clipbert for video-and-language learning via sparse sampling [C]//Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 7331−7341
[121]	Yu Weijiang, Zheng Haoteng, Li Mengfei, et al. Learning from Inside: Self-driven siamese sampling and reasoning for video question answering [C]//Proc of the 34th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2021: 26462−26474
[122]	Amrani E, Ben-Ari R, Rotman D, et al. Noise estimation using density estimation for self-supervised multimodal learning [C]//Proc of the 35th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2021: 6644−6652
[123]	Luo Jianjie, Li Yehao, Pan Yingwei, et al. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising [C]//Proc of the 29th ACM Int Conf on Multimedia. New York: ACM, 2021: 5600−5608
[124]	Seo P H, Nagrani A, Schmid C. Look before you speak: Visually contextualized utterances [C]//Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 16877−16887
[125]	Lu Jiasen, Batra D, Parikh D, et al. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks [C]//Proc of the 32nd Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2019: 13−23
[126]	Fu T J, Li Linjie, Gan Zhe, et al. VIOLET: End-to-end video-language transformers with masked visual-token modeling[J]. arXiv preprint, arXiv: 2111. 12681, 2021
[127]	Liu Ze, Ning Jia, Cao Yue, et al. Video swin transformer [C]//Proc of the 35th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 3202−3211
[128]	Zhou Luowei, Liu Jingjing, Cheng Yu, et al. Cupid: Adaptive curation of pre-training data for video-and-language representation learning[J]. arXiv preprint, arXiv: 2104. 00285, 2021
[129]	Le T M, Le V, Venkatesh S, et al. Hierarchical conditional relation networks for video question answering [C]//Proc of the 33rd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 9972−9981
[130]	Dang L H, Le T M, Le V, et al. Hierarchical object-oriented spatio-temporal reasoning for video question answering [C]//Proc of the 30th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2021: 636−642
[131]	Xiao Junbin, Yao A, Liu Zhiyuan, et al. Video as conditional graph hierarchy for multi-granular question answering [C]//Proc of the 36th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2022: 2804−2812
[132]	Yi Kexin, Gan Chuang, Li Yunzhu, et al. CLEVRER: Collision events for video representation and reasoning [C/OL]//Proc of the Int Conf on Learning Representations. 2020 [2022-09-03].https://openreview.net /forum?id=HkxYzANYDB
[133]	Chen Zhenfang, Mao Jiayuan, Wu Jiajun, et al. Grounding physical concepts of objects and events through dynamic visual reasoning [C/OL]//Proc of the Int Conf on Learning Representations. 2021 [2022-09-03].https://openreview.net/pdf?id=bhCDO_cEGCz
[134]	Ding Mingyu, Chen Zhenfang, Du Tao, et al. Dynamic visual reasoning by learning differentiable physics models from video and language [C]//Proc of the 34th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2021, 34: 887−899
[135]	Mao Jiayuan, Gan Chuang, Kohli P, et al. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision [C/OL]//Proc of the Int Conf on Learning Representations. 2019 [2022-09-04].https://research.ibm.com/publications/the-neuro-symbolic-concept-learner-interpreting-scenes-words-and-sentences-from-natural-supervision
[136]	Xu Li, Huang He, Liu Jun. SUTD-TrafficQA: A question answering benchmark and an efficient network for video reasoning over traffic events [C]//Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 9878−9888
[137]	Garcia N, Otani M, Chu Chenhui, et al. KnowIT VQA: Answering knowledge-based questions about videos [C]//Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 10826−10834
[138]	Han Yahong, Wang Bo, Hong Richang, et al. Movie question answering via textual memory and plot graph[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 30(3): 875−887
[139]	Maharaj T, Ballas N, Rohrbach A, et al. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering [C]//Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 6884−6893
[140]	Mun J, Hongsuck Seo P, Jung I, et al. MarioQA: Answering questions by watching gameplay videos [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 2867−2875
[141]	Ates T, Atesoglu M S, Yigit C, et al. CRAFT: A benchmark for causal reasoning about forces and interactions [C]//Proc of the 2022 Findings of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 2602–2627
[142]	Yu Zhou, Xu Dejing, Yu Jun, et al. Activitynet-QA: A dataset for understanding complex web videos via question answering [C]//Proc of the 33rd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2019: 9127−9134
[143]	Torabi A, Tandon N, Sigal L. Learning language-visual embedding for movie understanding with natural-language[J]. arXiv preprint, arXiv: 1609. 08124, 2016
[144]	Fan Chenyou. EgoVQA-an egocentric video question answering benchmark dataset [C]//Proc of the 2019 IEEE Int Conf on Computer Vision Workshops. Piscataway, NJ: IEEE, 2019: 4359−4366
[145]	Zadeh A, Chan M, Liang P P, et al. Social-IQ: A question answering benchmark for artificial social intelligence [C]//Proc of the 32nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 8807−8817
[146]	Choi S, On K W, Heo Y J, et al. DramaQA: Character-centered video story understanding with hierarchical QA [C]//Proc of the 35th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI 2021: 1166−1174
[147]	Castro S, Azab M, Stroud J, et al. LifeQA: A real-life dataset for video question answering [C]//Proc of the 12th Language Resources and Evaluation Conf. Marseille: European Language Resources Association (ELRA), 2020: 4352−4358
[148]	Colas A, Kim S, Dernoncourt F, et al. Tutorial-VQA: Question answering dataset for tutorial videos [C]//Proc of the 12th Language Resources and Evaluation Conf. Marseille: European Language Resources Association (ELRA), 2020: 5450–5455
[149]	Xiao Junbin, Shang Xindi, Yao A, et al. NExT-QA: Next phase of question-answering to explaining temporal actions [C]//Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 9777−9786
[150]	Grunde-McLaughlin M, Krishna R, Agrawala M. AGQA: A benchmark for compositional spatio-temporal reasoning [C]//Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 11287−11297
[151]	Wu Bo, Yu Shoubin, Chen Zhenfang, et al. STAR: A benchmark for situated reasoning in real-world videos [C/OL]//Proc of the 35th Conf on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). Cambridge, MA: MIT Press, 2021 [2022-01-25].https://openreview.net/forum?id=EfgNF5-ZAjM
[152]	Jia Baoxiong, Lei Ting, Zhu Songchun, et al. EgoTaskQA: Understanding human tasks in egocentric videos [C/OL]//Proc of the 36th Advances in Neural Information Processing Systems Datasets and Benchmarks Track. Cambridge, MA: MIT, 2022 [2022-01-25].https://openreview.net/forum?id=ttxAvIQA4i_
[153]	Jasani B, Girdhar R, Ramanan D. Are we asking the right questions in MovieQA? [C]//Proc of the IEEE Int Conf on Computer Vision Workshops. Piscataway, NJ: IEEE, 2019: 1879−1882
[154]	Rohrbach A, Torabi A, Rohrbach M, et al. Movie description[J]. International Journal of Computer Vision, 2017, 123(1): 94−120 doi: 10.1007/s11263-016-0987-1
[155]	Kolve E, Mottaghi R, Han W, et al. AI2-THOR: An interactive 3D environment for visual AI[J]. arXiv preprint, arXiv: 1712. 05474, 2017
[156]	Li Yuncheng, Song Yale, Cao Liangliang, et al. TGIF: A new dataset and benchmark on animated Gif description [C]//Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 4641−4650
[157]	Guadarrama S, Krishnamoorthy N, Malkarnenkar G, et al. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2013: 2712−2719
[158]	Ji Jingwei, Krishna R, Fei-Fei L, et al. Action genome: Actions as compositions of spatio-temporal scene graphs [C]//Proc of the 33rd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10236−10247
[159]	Thomee B, Shamma D A, Friedland G, et al. YFCC-100M: The new data in multimedia research[J]. Communications of the ACM, 2016, 59(2): 64−73 doi: 10.1145/2812802
[160]	Sigurdsson G A, Russakovsky O, Gupta A. What actions are needed for understanding human actions in videos? [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 2137−2146
[161]	Wang Xin, Wu Jiawei, Chen Junkun, et al. VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 4581−4591
[162]	Jia Baoxiong, Chen Yixin, Huang Siyuan, et al. LEMMA: A multi-view dataset for learning multi-agent multi-task activities [C]//Proc of the 16th European Conf on Computer Vision. Berlin: Springer, 2020: 767−786
[163]	Wang Xinyu, Liu Yuliang, Shen Chunhua, et al. On the general value of evidence, and bilingual scene-text visual question answering [C]//Proc of the 33rd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10126−10135
[164]	Marino K, Chen Xinlei, Parikh D, et al. Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based VQA [C]//Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 14111−14121
[165]	Zhang Yifeng, Jiang Ming, Zhao Qi. Explicit knowledge incorporation for visual reasoning [C]//Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 1356−1365
[166]	Wang Peng, Wu Qi, Shen Chunhua, et al. FVQA: Fact-based visual question answering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(10): 2413−2427
[167]	Wu Qi, Wang Peng, Shen Chunhua, et al. Ask me anything: Free-form visual question answering based on knowledge from external sources [C]//Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 4622−4630
[168]	Marino K, Rastegari M, Farhadi A, et al. Ok-VQA: A visual question answering benchmark requiring external knowledge [C]//Proc of the 32nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 3195−3204
[169]	Shah S, Mishra A, Yadati N, et al. KVQA: Knowledge-aware visual question answering [C]//Proc of the 33rd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2019: 8876−8884
[170]	Chen Zhou, Chen Jiaoyan, Geng Yuxia, et al. Zero-shot visual question answering using knowledge graph [C]//Proc of the 20th Int Semantic Web Conf. Berlin: Springer, 2021: 146−162
[171]	Wu Jialin, Lu Jiasen, Sabharwal A, et al. Multi-modal answer validation for knowledge-based VQA [C]//Proc of the 36th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2022: 2712−2721
[172]	Wu Qi, Shen Chunhua, Wang Peng, et al. Image captioning and visual question answering based on attributes and external knowledge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(6): 1367−1381
[173]	Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision [C]//Proc of the 38th Int Conf on Machine Learning. New York: ACM, 2021: 8748−8763
[174]	Ju Chen, Han Tengda, Zheng Kunhao, et al. Prompting visual-language models for efficient video understanding [C]//Proc of the 17th European Conf on Computer Vision. Berlin: Springer, 2022: 105−124