-
摘要:
视频问答 ( video question answering,VideoQA ) 根据视频内容自动回答自然语言问题,是视觉语言领域较为新兴的一个研究方向, 近年来引起了广泛关注. VideoQA问题的解决对于人机交互、智慧教育、智能交通、场景分析以及视频检索等各个领域都有着重大意义. VideoQA是一项具有挑战性的任务,因为它需要模型同时理解视频与文本内容来生成问题的答案. 首先,分析了VideoQA与图像问答 ( image question answering,ImageQA )的区别,总结了当下VideoQA相对于ImageQA所面临的4个挑战;然后,围绕着这些挑战对目前现有VideoQA模型进行了细致的分类,并重点介绍了模型的实现及不同模型之间的关联;接着详细介绍了在VideoQA中常用的基准数据集及目前主流算法在部分数据集上的性能,并进行了对比与分析;最后,讨论了该领域未来面临的挑战和研究趋势,为未来进一步研究提供一些思路.
Abstract:VideoQA (video question answering), which automatically answers natural language question according to the content of videos, is a relatively new research direction in the field of visual language and has attracted extensive attention in recent years. The solution of videoQA task is of great significance for human-computer interaction, intelligent education, intelligent transportation, scenario analysis, video retrieval, and other fields. VideoQA is a challenging task because it requires a model to understand semantic information of the video and the question to generate the answer. In this work, we analyze the difference between VideoQA and ImageQA (image question answering), and summarize four challenges faced by VideoQA relative to ImageQA. Then, the existing VideoQA models are carefully classified according to the research method around these challenges. Following the classifications, we introduce the generation background and focus on the implementation of models and the relationship between different models. After that, the benchmark datasets commonly used in VideoQA are summarized, the performances of current mainstream algorithms on some datasets are introduced in detail, and the comparison, analysis and summary are carried out. Finally, the future challenges and research trends in this field are discussed, which will provide some ideas for further research in the future.
-
车联网[1-3]作为一种多领域交叉的新兴网络,涉及信息通信、交通、汽车等领域,引起了国内外工业界与学术界的广泛关注. 蜂窝车联网(cellular vehicle-to-everything,C-V2X)技术是实现车联网中车与车(vehicle-to-vehicle,V2V)、车与基础设施(vehicle-to-roadside infrastructure,V2I)、车与网络(vehicle-to-network,V2N)以及车与人(vehicle-to-pedestrian,V2P)等全方位连接和通信的新一代信息通信技术,如图1所示,其中V2N是指车辆通过接入网或核心网与云平台连接,云平台与车辆之间进行数据交互,提供车辆所需要的各类应用服务[4],如车辆导航、车辆远程监控、紧急救援、信息娱乐服务等,而V2V通信侧重于车辆之间提供低延迟、高可靠、严时效的实时信息传输服务[5-6].
与传统的无线蜂窝网络相比,车联网具有高动态、时空关联、不确定特性以及严格的服务质量(quality of service,QoS)要求,这使得C-V2X通信面临着诸多特有的挑战. 从空间维度来看,多个用户同时存在于同一网络中,并相互竞争有限的无线通信资源, 适当协调用户的传输行为,处理随机噪声、衰落和干扰的联合影响是必要的;从时间维度来看,由于车联网的动态不确定特性,难以获得准确的信道状态信息(channel state information,CSI),需要自适应、快速准确地作出传输决策;从业务类型来看,不同类型链路需要支持不同QoS要求的应用. 因此,针对车联网动态不确定特性、业务类型的多元化以及无线通信资源稀缺的特点,研究V2N和V2V链路资源协同共享以保证C-V2X车联网业务的多指标需求和无线资源的有效利用,是当前车联网资源分配亟需解决的问题[7-8].
鉴于上述问题,从多目标优化(multi-objective optimization,MO)的角度研究了典型多用户C-V2X通信网络的高效传输设计. 特别地,信息年龄(age of information,AoI)是一种有效的度量信息新鲜度的方法,AoI 描述了自最新状态更新被生成以来所经过的时间[9],与传统网络性能指标(如延迟、可靠性或传输速率)不同,AoI被视为一种时效性性能指标. 与现有的大多数研究工作通常仅优化单一目标不同,进一步提出了以V2V链路的AoI为目标之一的MO问题 [10]. 在动态复杂环境下,确定跨多个时隙的V2V信道和功率,以保证V2N与V2V通信的QoS要求. 主要贡献有3个方面:
1)考虑到车联网中V2N与V2V通信的 QoS需求差异,提出了一个新的多目标优化无线资源分配(multi-objective optimization for wireless resource allocation,MO-WRA)问题,在V2N和V2V链路共存且共享频谱的复杂蜂窝车联网情况下确定信道选择和功率控制,以实现不同链路优化目标之间的权衡,同时保证V2V通信链路的AoI.
2)由于MO-WRA问题涉及动态不确定环境下信道状态信息不准确、时间相关的非凸目标和约束以及相互影响的目标,这导致了极大的决策空间. 结合进化学习,进一步设计了基于多目标深度强化学习的V2V资源分配算法,通过训练好的神经网络模型可以得到MO-WRA问题的帕累托前沿.
3)为了应对大规模V2V通信,加速决策网络提取关键环境状态信息,引入注意力机制以优化深度神经网络,提升神经网络训练速度,增强其实时决策能力.
1. 相关工作
近年来,蜂窝车联网资源分配问题得到了广泛研究. 文献[11]设计了资源分配算法,该算法在传输可靠性和排队时延的约束下,使V2V链路的总吞吐量最大化. 文献[12]提出了用于V2I和V2V链路共存的网络调度和功率控制算法,以提高系统吞吐量. 文献[13]提出了一种双时间尺度资源分配算法,该算法基于大时间尺度道路交通信息减小V2V链路传输的最大时延. 文献[14]也提出了一种基于V2V和V2I通信的随机模型,该模型结合车辆移动性、信道争用和衰落的影响,提高了通信和计算的可靠性. 尽管上述资源分配策略通过有效的资源分配来提升网络性能,但它们主要关注传统性能指标,如吞吐量、可靠性和时延等,而无法准确衡量接收端的信息新鲜度.
目前学术界提出了“信息年龄AoI”的概念[9]. AoI是一种有效度量信息新鲜度的性能指标,被定义为接收端获取的最新数据包自产生时刻到当前接收时刻所经过的时间. 通过资源分配以优化车联网中的AoI性能已成为当前的研究热点. 现有研究主要通过控制信息发送频率避免网络拥塞和降低网络传输时延以最小化系统平均AoI. 文献[15]采用李雅普诺夫(Lyapunov)优化方法设计了一种分布式年龄感知数据收集算法,该算法包括基于阈值的源车辆采样策略,可以更加及时地收集状态更新. 文献[16]结合平均场理论来分析虚拟传感器网络的网络AoI,充分考虑了车辆网络的社会特征和潜在的无线通信过程,进一步联合优化源节点处信息更新速率和传感器处的传输概率以最小化平均AoI. 文献[17]利用极值理论和Lyapunov优化方法,考虑到AoI极端事件发生概率极低的情况,提出了一种感知AoI的资源分配算法,以保证超可靠的低时延通信. 但是该算法假设每个车辆用户对中发射机和接收机之间的关联是固定的,这种假设可能会简化模型,但同时也可能忽略了车联网环境中的动态变化,从而限定了对车联网动态性的真实反映. 现有的AoI相关的研究大多采用传统的排队论建立理论模型,这种方法在处理大规模动态复杂场景存在局限. 而利用机器学习进行数据驱动建模以更准确地捕捉和模拟车联网中的动态行为,尚有很大的探索空间.
当前研究大多基于一个理想化的假设,即能够获得全局信道状态信息,然而,在车辆高速移动的场景下,信道条件的快速变化使得获取精确的信道状态信息变得极为困难[18];此外,尽管通过传统信息论方法获得较为准确的信道状态信息[19-20],但由于计算成本高,仍然难以满足动态车联网环境对于实时应用的迫切需求. 深度强化学习(deep reinforcement learning,DRL)融合了深度学习的感知和强化学习的决策2种特性, 既可从高维原始数据中直接获取动态环境特征, 又具有传统动态规划和马尔可夫决策过程的理论保障以使得网络能够通过智能体与环境的交互来学习动态资源分配策略.
文献[21]提出了一种基于信赖域策略优化的车联网联合频谱和功率分配算法,重点研究了系统平均AoI的最小化问题. 在参考文献[22]中,设计了基于多智能体强化学习的分布式资源分配算法,以最小化系统平均AoI. 文献[23]研究了一种考虑车辆数据包传输模式选择的资源分配问题,并将双时间尺度深度强化学习与谱聚类相结合以提高模型的鲁棒性. 文献[24]针对V2V复用V2I链路的频谱问题提出了基于深度Q网络(deep Q-network,DQN)的算法. 虽然该算法考虑了多个优化目标,但是仅考虑各目标权重参数给定的情况,实质上仍是单目标优化. 实际情况下,不同通信链路可能具有不同的传输能力和传输需求,存在多个不一致甚至冲突的优化目标. 在没有预先设定目标权重参数的情况下,对任何目标的优化往往不可避免地会以至少1个其他目标的性能下降为代价. 这种现象是多目标优化领域的典型特征,其中需要在多个通常相互冲突的目标之间寻找平衡点[10]. MO理论已应用于无线通信网络[25]、移动边缘计算系统[26]、车联网[27]. 同样是车联网场景,文献[27]提出了一种基于多目标优化理论的顺序传输决策算法,该算法采用了Lyapunov优化理论与加权切比雪夫(Chebyshev)方法,在保证不同消息的QoS的同时最大化链路的能量有效性. 针对现有文献的分析表明,MO理论与深度强化学习结合的研究工作尚未得到充分的探索.
与已有工作不同,本文针对蜂窝车联网复杂、动态且不确定场景,进行了深入的探讨. 除了涵盖传统网络性能指标,本文还创新性地将AoI纳入优化目标,提出了多目标优化的资源分配问题,并结合多目标进化学习设计了多目标深度强化学习算法,该算法能够实时进行决策,优化V2V无线资源的分配,以满足多样化的QoS要求.
2. 系统模型与问题描述
本节所涉及的主要符号如表1所示.
表 1 主要符号汇总Table 1. Main Notations Summary符号 解释 M V2N链路集合 K V2V链路集合 V 车辆集合 γcm 第m条V2N链路的信干噪比 gcm 第m条V2N链路的信道功率增益 ˜gvk,m 第k条V2V链路对m条V2N链路的干扰功率增益 Pcm 第m条V2N链路的传输功率 Pvk 第k条V2V链路的传输功率 ρm,k 第m条V2N链路和第k条V2V链路是否共用信道 Ccm 第m条V2N链路的传输速率 Cvk 第k条V2V链路的传输速率 γvk 第k条V2V链路的信干噪比 Ick 第k条V2V链路受到V2N链路的干扰 Ivk 第k条V2V链路受到其他V2V链路的干扰 gvk 第k条V2V链路的信道功率增益 Btk 在时隙t第k条V2V链路的剩余负载量 Utk 在时隙t第k条链路的传输延迟容限 Ltk 在时隙t第k条V2V链路的数据包延迟 Ati,j 在时隙t车辆i发送的数据在车辆j接收处的AoI 2.1 网络模型
本文研究的网络场景如图2所示,由1个基站(base station,BS)和位于基站通信覆盖范围内的车辆组成,车辆与基站、车辆与车辆之间可以相互通信. 假设基站具有计算和缓存能力. 具体而言,车辆网络包括mmax条V2N链路,以集合 M = \{ 1,2, … ,m, … ,{m_{\max }}\} 表示V2N链路序号,以及{k_{\max }}条V2V链路,以集合K = \{ 1,2, … ,k, … ,{k_{\max }}\} 表示V2V链路序号.
在设计的系统中,可以将BS通信范围建模为2维欧氏空间\psi ,在该范围内包含n个车辆,以集合 V = \{ 1,2, … ,n\} 表示车辆序号. 每个车辆 i \in V 由\{ {x_i},{y_i},{o_i}, {v_i},{V_i{\rm ^N}},{V_i{\rm ^T}},{A_i}\} 表示. 其中 {x_i} 和 {y_i} 为欧氏空间坐标,{o_i}为车辆行驶的方向,{v_i}为车辆速度,车辆邻居集合表示为 {V_i{\rm ^N}} ,车辆i的目标通信车辆集合表示为{V_i{\rm ^T}},车辆i发送的数据在其他车辆接收处的AoI表示为{A_i} = \{ {A_{i,1}},{A_{i,2}}, … ,{A_{i,n}}\} ,用以表征车辆之间所传输信息的新鲜程度.
2.2 通信模型
假设每条V2N链路已被预先分配不同的正交子信道以消除网络中V2N链路之间的干扰,即第m条V2N链路占用第m个子信道,保证了链路之间的无干扰. 为提高频谱利用率,假设V2N子信道可以被V2V链路共享,车辆的收发机采用单天线,当第k条V2V链路共享第m条V2N链路的子信道时,这条V2V链路的接收端可能会受到来自相同子信道的其他V2V链路以及V2N链路的发射端的干扰. 因此系统中可能出现3种干扰,分别为:V2N占用的子信道对共享该子信道的V2V的干扰,简称C2V(cellular user-to-vehicle)干扰;V2V占用的子信道对使用该子信道的V2N的干扰,简称V2C(vehicle-to-cellular user)干扰;V2V占用的子信道对占用相同子信道的其他V2V用户对的干扰,简称V2V干扰.
为了便于建模,将连续时间离散化,用t来表示离散化后的时隙,其中每个时隙的持续时间为{t_0}. 在每个时隙t,基站需要为车辆用户对分配传输信道和发射功率,传输信道集合O和发射功率集合P分别表示为O = \left\{ {{O_1},{O_2}, … ,{O_{\max }}} \right\}和P = \left\{ {{P_1},{P_2}, … ,{P_{\max }}} \right\}.
进一步地,定义g_k^{\text{v}}为第k条V2V链路的信道功率增益,g_m^{\text{c}}为第m条V2N链路的信道功率增益,\tilde g_{m,k}^{\text{c}}表示第m条V2N链路对复用该链路的第k条V2V链路的干扰功率增益;\tilde g_{k,k'}^{\text{v}}表示复用相同V2N链路的第k条V2V链路对第k'条V2V链路的干扰功率增益. 上述信道功率增益和干扰功率增益均由快衰落和慢衰落组成. 快衰落部分的主要成因是多径效应,慢衰落部分的主要成因包括路径损耗和阴影衰落.
以第m条V2N链路的信道功率增益g_m^{\text{c}}为例,其计算公式可表示为
g_m^{\text{c}} = h_m^{\text{c}}\alpha _m^{\text{c}}{\text{ = }}h_m^{\text{c}}{\beta _m}d_{m,{\text{B}}}^{ - \chi }, (1) 其中h_m^{\text{c}}是快衰落部分,其服从瑞利(Rayleigh)分布,不同的信道下的快衰落是独立同分布的,\alpha _m^{\text{c}}是慢衰落部分,{\beta _m}是具有标准差\xi 的对数正态阴影衰落, {d_{m,{\text{B}}}} 是信号发射机和接收机之间的欧氏距离, \chi 是路径损耗分量的衰减指数.
以V2N链路占用的子信道对共享该子信道的V2V链路的干扰为例,其干扰功率的计算公式可表示为
I_k^{\text{c}} = \displaystyle\sum\limits_{m = 1}^{{m_{\max }}} {{\rho _{m,k}}P_m^{\text{c}}\tilde g_{m,k}^{\text{c}}} , (2) 其中P_m^{\text{c}}是第m条V2N链路的发射功率,{\rho _{m,k}}表示第m条V2N链路和第k条V2V链路是否共用信道,满足式(3):
{\rho }_{m,k}=\left\{\begin{aligned} &1,\;\; 第k条\text{V}2\text{V}链路重用第m条\text{V}2\text{N}链路的子信道, \\ &0,\;\; 其他. \end{aligned}\right. (3) 对于第m条V2N链路而言,信干噪比\gamma _m^{\text{c}}(signal-to-interference-plus-noise ratio,SINR)可表示为
\gamma _m^{\text{c}} = \dfrac{{P_m^{\text{c}}g_m^{\text{c}}}}{{{\sigma ^2} + \displaystyle\sum\limits_{k = 1}^{{k_{\max }}} {{\rho _{m,k}}P_k^{\text{v}}\tilde g_{k,m}^{\text{v}}} }}, (4) 其中{\sigma ^2}表示加性高斯白噪声功率, P_m^{\text{c}} 和 P_k^{\text{v}} 分别表示第m条V2N链路和第k条V2V链路的发射功率.
对于第k条V2V链路,其SINR可表示为
\gamma _k^{\text{v}} = \dfrac{{P_k^{\text{v}}{g_k}}}{{{\sigma ^2} + I_k^{\text{c}} + I_k^{\text{v}}}}. (5) 根据上述V2N链路和V2V链路的SINR,可以得出V2V链路复用V2N链路时,第m条V2N链路的传输速率C_m^{\text{c}}和第k条V2V链路的传输速率C_k^{\text{v}}的表达式分别为
C_m^{\text{c}} = W \times {\text {lb}}\left( {1 + \gamma _m^{\text{c}}} \right), (6) C_k^{\text{v}} = W \times {\text {lb}}\left( {1 + \gamma _k^{\text{v}}} \right), (7) 其中W表示信道带宽.
因此,第k条V2V链路的有效传输概率{p_k}为
{p_k} = Pr\left\{ {R_k^{\text{v}} \geqslant {R_{\text{T}}}} \right\} = Pr\left\{ {\dfrac{1}{{{T_{\text{d}}}}}\displaystyle\sum\limits_{t = 1}^{{T_{\text{d}}}} {C_k^{{\text{v}},t} \geqslant \dfrac{{{B_0}}}{{{U_0}}}} } \right\}, (8) 其中R_k^{\text{v}}表示第k条V2V链路的有效传输速率,{R_{\text{T}}}表示有效传输速率阈值,C_k^{{\text{v}},t}表示在时隙t第k条V2V链路传输速率,{B_0}表示数据包的大小,{U_0}表示传输时延约束,时长为{T_{\rm d}}个时隙.
对于V2V链路而言,还需要关注其延迟性能. 假设只考虑数据包的传输延迟. 定义{U_0}为数据包生成时的延迟容限,U_k^t表示在时隙t第k条V2V链路的延迟容限,更新公式为
U_k^{t + 1} = U_k^t - {t_0}, (9) 其中{t_0}为1个时隙的长度. 如果在时隙t的前一个时隙内,第k条V2V链路完成了数据包传输,那么该数据包的延迟L_k^t可表示为
L_k^t{\text{ = }}{U_0} - U_k^t. (10) 如果在时隙t有U_k^t \leqslant 0,则说明该数据包的传输时间超过了延迟容限,视为传输失败,不再继续传输该数据包.
2.3 AoI演进模型
本节介绍AoI以及V2V链路平均AoI的定义及其计算方式.
定义B_k^t为时隙t第k条V2V链路传输数据包的剩余负载量,B_k^t的更新公式为
{B}_{k}^{t+1}=\left\{\begin{aligned} &{B}_{0},\;\;\;\;\quad\quad\quad {B}_{k}^{t} < {C}_{k}^{\text{v},t}{t}_{0},\\ &{B}_{k}^{t+1}-{C}_{k}^{\text{v},t}{t}_{0},\;\; 其他,\end{aligned} \right. (11) 其中{B_0}为初始负载量. 若在当前时隙t内完成了数据包的传输,则在下一个时隙t + 1将该用户的负载重置为初始负载量{B_0};否则,在B_k^{t + 1}的基础上减去时隙t内传输的数据量 C_k^{{\text{v}},t}{t_0} .
集合{A_i}表示车辆i发送的数据在其他车辆接收处的AoI,所以集合{A_i}中元素的个数即为车辆数量n,其中 {A_{i,j}} 表示V2V链路中车辆i发送的数据在车辆j接收处的AoI. A_{i,j}^t表示在时隙t车辆i发送的数据在在车辆j接收处的AoI,A_{i,j}^{t + 1}的计算如式(12)所示:
A_{i,j}^{t + 1} = \left\{ {\begin{aligned} & L_{i,j}^{t + 1}, \;\;\;\quad B_{i,j}^t \leqslant C_{i,j}^{{\text{v}},t}{t_0},\;\;d_{i,j}^{t + 1} \leqslant {d_{\text{c}}}, \\ & A_{i,j}^t + {t_0}, \;\; B_{i,j}^t > C_{i,j}^{{\text{v}},t}{t_0},\;\;d_{i,j}^{t + 1} \leqslant {d_{\text{c}}}, \\ & 0, \;\;\;\quad\quad d_{i,j}^{t + 1} > {d_{\text{c}}}, \end{aligned}} \right. (12) 其中 L_{i,j}^{t + 1}表示数据包的延迟, d_{i,j}^{t+1} 表示车辆i和车辆j之间的欧氏距离,{d_{\text{c}}}表示车辆的通信距离. 当车辆i和车辆j位于通信距离内,如果前一个时隙内有数据包传输成功, A_{i,j}^{t + 1} 为该数据包的延迟,否则A_{i,j}^{t + 1}随时隙数不断累加;当车辆i和车辆j位于通信距离外,t = 0.
V2V链路平均AoI可表示为
{{\bar A}^t} = \dfrac{1}{{cnt}}\displaystyle\sum\limits_{i = 1}^n {\displaystyle\sum\limits_{\begin{subarray}{l} j = 1 \\ j \ne i \end{subarray}} ^n {\left( {A_{i,j}^t + A_{j,i}^t} \right)} } , (13) 其中 cnt 表示AoI非0值的个数,可由式(14)计算得到:
cnt = \displaystyle\sum\limits_{i = 1}^n {\displaystyle\sum\limits_{j = 1}^n {\alpha _{i,j}^t} } , (14) 其中\alpha _{i,j}^t \in \left\{ {0,1} \right\}表示车辆i和车辆j是否位于通信距离内,若 d_{i,j}^{t + 1} \leqslant {d_{\text{c}}} ,则\alpha _{i,j}^t = 1,否则\alpha _{i,j}^t = 0.
2.4 MO-WRA问题描述
本节提出多目标优化的无线资源分配MO-WRA问题,基站根据车辆传输的状态信息,给该车辆分配信道和发射功率,以实现优化目标,如式(15)所示:
\left\{\begin{aligned} & \mathop {{\text{maximize }}}\limits_{\left\{ {P_k^{\rm v},\;{\rho _{m,k}}} \right\}} \left( {\displaystyle\sum\limits_{t = 1}^\infty {{V^t}} ,\displaystyle\sum\limits_{t = 1}^\infty {{Y^t}} ,\displaystyle\sum\limits_{t = 1}^\infty {Z_k^t} ,\displaystyle\sum\limits_{t = 1}^\infty {{S^t}} } \right) \\ & {\text{s.t.}} \left\{\begin{aligned} &{{\text{C}}_{\text{1}}}:L_k^t \leqslant {U_0},\forall k \in K,\forall t \in {\mathbb{N}_ + }; \\ & {{\text{C}}_{\text{2}}}:{p_0} \leqslant {p_k},\forall k \in K; \\ & {{\text{C}}_3}:P_k^{\rm v} \in P,\forall k \in K; \\ & {{\text{C}}_4}:{\rho _{m,k}} \in \left\{ {0,1} \right\},\forall m \in M,\forall k \in K. \end{aligned}\right. \end{aligned} \right. (15) 约束{{\rm C}_1}表示V2V链路延迟约束;约束{\rm C_2}表示V2V链路的有效传输概率约束;约束{\rm C_3}表示V2V链路发射功率约束;约束{\rm C_4}表示V2V链路是否复用V2N链路子信道.
式(15)中 P_k^{\rm v},{\rho _{m,k}} 分别表示第k条V2V链路的发射功率和频谱分配. 针对优化目标中的V2N链路传输速率、V2V链路传输速率、V2V链路延迟以及V2V链路平均AoI这4个指标,分别定义{V^t}为时隙t的V2N链路传输速率效用函数,{Y^t}为时隙t的V2V链路传输速率效用函数, Z_k^t 为时隙t的第k条V2V链路的延迟效用函数,S_k^t为时隙t的第k条V2V链路的平均AoI效用函数. {V^t}, {Y^t},Z_k^t,{S^t_k}分别表示为
{V^t} {\buildrel \Delta \over = } \displaystyle\sum\limits_{m = 1}^{{m_{\max }}} {\dfrac{{C_m^{{\text{c}},t} - C_{\min }^{\text{c}}}}{{C_{\max }^{\text{c}} - C_{\min }^{\text{c}}}}} , (16) {Y^t} {\buildrel \Delta \over = } \displaystyle\sum\limits_{k = 1}^{{k_{\max }}} {\dfrac{{C_k^{{\text{v}},t} - C_{\min }^{\text{v}}}}{{C_{\max }^{\text{v}} - C_{\min }^{\text{v}}}}} , (17) Z_k^t {\buildrel \Delta \over = } \exp \left( { - L_k^t} \right), (18) S_k^t {\buildrel \Delta \over = } \exp \left( { - \left( {{{\bar A}^t} - {{\bar A}^{t - 1}}} \right)} \right). (19) 3. 多目标深度强化学习算法
针对多目标优化的无线资源分配MO-WRA问题,本节进一步设计了基于注意力机制的多目标近端策略优化算法(multi-objective proximal policy optimization algorithm based on attention mechanism,MOPPO-AM),如3.1节所述,该算法包含2个阶段. 在第1阶段,将MO-WRA问题划分为N个单目标优化的子问题,并将该子问题表述为马尔可夫决策过程,如3.2节所述. 通过基于注意力机制的近端策略优化算法(proximal policy optimization algorithm based on attention mechanism,PPO-AM)进行训练,使用的神经网络模型如3.3节所述. 将所有训练好的子问题神经网络模型集作为第2阶段的初始种群,如3.4节所述. 在第2阶段,使用进化学习找到多目标主问题的帕累托前沿,如3.5节所述.
3.1 MOPPO-AM 算法
MOPPO-AM算法训练过程如算法1所示,该算法通过近端策略优化(proximal policy optimization,PPO)[28]算法来训练基于卷积块注意力模块(convolutional block attention module,CBAM)[29-30]的模型. 然后,将训练好的子问题模型集作为进化学习的初始种群. 进化学习使用近端变异[31]产生后代,基于非支配排序遗传算法Ⅱ(nondominated sorting genetic algorithm Ⅱ,NSGA-Ⅱ)[32]的非支配水平和拥挤距离排序选择模型,最终得到多目标优化问题的一组非支配解,也称为帕累托前沿.
算法1. MOPPO-AM算法.
输入:子问题模型集 \varPi = \left\{ {{\pi _1},{\pi _2}, … ,{\pi _N}} \right\} ,权重向量集合 \varLambda = \left\{ {{{\boldsymbol \lambda} _1},{{\boldsymbol \lambda} _2}, … ,{{\boldsymbol \lambda} _N}} \right\} ,进化学习中的最大迭代次数{I_{\max }},权重向量个数为N;
输出:优化后的模型集 {\varPi ^ * } = \left\{ {\pi _1^ * ,\pi _2^ * , … ,\pi _N^ * } \right\} .
① 随机初始化 \varPi 中的模型参数;
② for i = 1,2, … ,N do
③ 使用PPO-AM算法训练策略 {\pi _i} (3.4节);
④ end for
⑤ 使用PPO-AM算法输出的策略作为进化学习 的初始种群;
⑥ for i = 1,2, … ,{I_{\max }} do
⑦ 使用进化学习进化子问题模型(3.5节);
⑧ end for
⑨ return {\varPi ^ * } = \left\{ {\pi _1^ * ,\pi _2^ * , … ,\pi _N^ * } \right\} .
3.2 马尔可夫决策过程建模
单目标优化子问题表述为马尔可夫决策过程,通常由如下所示的五元组来表征:
(S,A,R,\gamma ,p) , (20) 其中S是状态空间,A是动作空间,R是奖励函数,\gamma \in \left( {0,1} \right)为折扣因子,其体现了智能体对即时奖励和未来奖励的权衡,转移概率p是智能体执行某个动作后从一个状态转移到下一个状态的概率.
状态 s_k^t \in S 表示在时隙t的第k条链路的状态,状态总共包含3类信息. 其中:第1类信息为整体的信道和干扰信息,包含V2V链路的信道信息{G^t}、V2N链路的信道信息{H^t}、前一个时隙的干扰功率信息 {I^{t - 1}} ,以及前一个时隙邻居车辆选择的信道信息N_k^{t - 1};第2类信息为与资源分配相关的数据包状态信息,包含待发送的信息的剩余负载 B_k^t 以及延迟容限U_k^t;第3类信息是V2V链路关联的AoI A_k^t . 综上所述,s_k^t表达式为
s_k^t{\text{ = }}\left\{ {{G^t},{H^t},{I^{t - 1}},N_k^{t - 1},O_k^t,U_k^t,A_k^t} \right\}. (21) 动作a_k^t \in A 包含信道选择O_k^c和发射功率选择P_k^c 2个维度.
a_k^t \in \left\{ {1,2, … ,3 \times \left| O \right|} \right\}. (22) 第1个维度信道数量 \left| O \right| 是有限的,第2个维度发射功率包含3个等级,因此动作空间大小为3 \times \left| O \right|. 在具体的实现过程中,使用 \left[ {1,3 \times \left| O \right|} \right] 范围内的整数对信道进行编号. 对于具体的动作a_k^t,使用a_k^t\% \left| O \right|表示选择的信道编号,使用 \left\lfloor {{{a_k^t} / {\left| O \right|}}} \right\rfloor 表示选择的发射功率在发射功率列表中的序号.
奖励函数 R\left( {s_k^t,a_k^t} \right) 表示在状态s_k^t且采取动作a_k^t的情况下,即时奖励的期望值. 在时隙t的第k条链路的即时奖励r_k^t的设计为
r_k^t = {\lambda _V}{V^t} + {\lambda _Y}{Y^t} + {\lambda _Z}Z_k^t + {\lambda _S}{S^t}. (23) 在奖励函数中,使用链路传输速率作为正向奖励,以此衡量当前决策对其余链路的干扰影响程度. V2V链路延迟计算由式(10)给出. 对于V2V链路平均AoI,将其进行差分处理,即将相邻时隙V2V链路平均AoI的变化量 {\bar A^t} - {\bar A^{t - 1}} 作为惩罚项参与奖励函数,由式(19)给出. 通过差分处理,可以减少即时奖励的方差,从而提高训练的稳定性. 与此同时,差分处理实际上将本工作的优化目标从直接最小化V2V链路平均AoI转换为最小化平均AoI的增长,这种转换可以更好地引导智能体的训练过程.
3.3 神经网络模型
为了提升算法实时决策能力,算法所采用的神经网络模型引入注意力机制,这有利于网络提取重要的状态信息. CBAM具有空间和通道注意力,可以关注关键特征,忽略无用特征,使网络更加灵活地适应不同任务和场景,以实现更迅速、更灵活的决策. 卷积神经网络生成特征图之后,CBAM通过2个独立的注意力机制,分别从通道和空间维度对特征图进行加权,以实现自适应特征强化. 这个轻量级的通用模块可以集成到各种卷积神经网络中进行端到端训练,如图3所示.
给定输入特征图{\boldsymbol{F}},通道注意力模块推断通道注意力向量,衡量每个通道的重要性. 空间注意力模块推导出3维空间注意力地图,帮助模型更好地理解和利用输入中的空间信息,从而提高了网络的性能和效率. 加权后的特征图为
{\boldsymbol{F'}} = {M_{\text{c}}}\left( {\boldsymbol{F}} \right) \otimes {\boldsymbol{F}}, (24) {\boldsymbol{F''}} = {M_{\text{s}}}\left( {{\boldsymbol{F'}}} \right) \otimes {\boldsymbol{F'}}, (25) 其中 {M_{\text{c}}} 和 {M_{\text{s}}} 分别表示基于通道的和基于空间的注意力, \otimes 表示逐元素乘法, {\boldsymbol{F'}} 和 {\boldsymbol{F''}} 分别表示进行了通道注意力和空间注意力后的输出特征图. 通道注意力模块关注输入数据中有意义的内容,表示为
{M_{\text{c}}}\left( {\boldsymbol{F}} \right) = \sigma \left( {MLP\left( {avgpool\left( {\boldsymbol{F}} \right)} \right) + MLP\left( {maxpool\left( {\boldsymbol{F}} \right)} \right)} \right), (26) 其中 \sigma (\cdot ) 表示Sigmoid函数, maxpool(\cdot ) 表示最大池化, avgpool(\cdot ) 表示平均池化, MLP(\cdot ) 表示多层感知器.
空间注意力模块关注输入数据中更有意义的位置,是对通道注意力的补充,表示为
{M_{\text{s}}}\left( {\boldsymbol{F}} \right) = \sigma \left( {conv\left( {\left[ {avgpool\left( {\boldsymbol{F}} \right);maxpool\left( {\boldsymbol{F}} \right)} \right]} \right)} \right), (27) 其中 conv(\cdot ) 表示3维卷积层.
3.4 基于PPO-AM的单目标子问题训练算法
在第1阶段,利用PPO-AM算法训练单目标子问题模型. PPO-AM是一种在线深度强化学习算法. 在线学习意味着智能体通过与环境的交互更新策略,而不是像批量学习那样积累一些经验,然后进行一次性的更新. 更新策略的过程是连续进行的,智能体通过不断地与环境交互和从经验池中获得轨迹信息以逐步地改进其策略. 在PPO-AM算法的每次迭代中,基站作为智能体,具有计算和缓存能力,能够使用当前的策略与环境交互,收集经验数据. 然后,利用这些数据,不断逼近状态价值函数和动作价值函数以寻找最优的资源分配策略,如图3所示.
PPO-AM算法是基于策略优化的PPO算法实现的. 其中PPO算法是一种基于策略优化的深度强化学习算法,是在Actor-Critic框架的基础上发展起来的,主要用于训练智能体在环境中采取最优动作策略. 2个策略网络和1个价值网络组成了PPO算法,2个策略网络即新策略和旧策略,比较新策略和旧策略之间的差异,并根据这种差异来确定策略参数的更新方向,这有助于限制策略更新的大小,维持训练的稳定性. PPO算法旨在最大化策略的累积奖励,如式(28)所示,该算法引入状态价值函数、动作价值函数和优势函数更新Actor网络和Critic网络,以找到最优任务选择策略.
\begin{split} & J({\boldsymbol \theta} ) = \\ & {E_t}\left[ {\min \left( {\dfrac{{{\pi _{\boldsymbol \theta} }({a_t}|{s_t})}}{{{\pi _{{\boldsymbol \theta} _{\text{old}}}}({a_t}|{s_t})}}{{\hat A}_t},clip\left( {\dfrac{{{\pi _{\boldsymbol \theta} }({a_t}|{s_t})}}{{{\pi _{{\boldsymbol \theta} _{\text{old}}}}({a_t}|{s_t})}},1 - {\boldsymbol \varepsilon} ,1 + {\boldsymbol \varepsilon} } \right){{\hat A}_t}} \right)} \right],\\ \end{split} (28) 其中J({\boldsymbol \theta} )是目标函数,表示期望累计奖励. {\pi _{\boldsymbol \theta} }({a_t}|{s_t}) 是在状态{s_t}下采取动作{a_t}的策略函数. {\pi _{{\boldsymbol \theta} _{\text{old}}}}({a_t}|{s_t}) 是旧策略函数,即在更新前的策略. {\hat A_t} 是优势函数,表示在状态{s_t}下采取动作{a_t}的优势. clip函数确保新策略和旧策略之间的比值在预定义的范围\left[ {1 - { \varepsilon} ,1 + { \varepsilon} } \right]内. 这有助于防止策略更新过大,从而提高训练的稳定性, { \varepsilon} 是用于限制策略更新幅度的超参数.
状态价值函数是指在状态{s_t}下,遵循策略 {\pi _{\boldsymbol \theta} } 能够获得的期望奖励,如式(29)所示:
V_{{\pi _{\boldsymbol \theta} }}^{\boldsymbol \phi} \left( {{s_t}} \right) = {E_{{\pi _{\boldsymbol \theta} }}}\left[ {\displaystyle\sum\limits_{i = 0}^\infty {{\gamma ^i}{r_{t + i + 1}}|s = {s_t}} } \right], (29) 其中 {r_{t + i + 1}} = \displaystyle\sum\limits_{k = 1}^{{k_{\max }}} {r_k^{t + i + 1}} . \gamma 是折扣因子,用于计算累计奖励,将未来的奖励进行折现. 折扣因子决定了未来奖励的重要性,较小的折扣因子会降低对未来奖励的重视,使智能体更倾向于采取即时奖励更高的动作.
动作价值函数是指在状态{s_t}下,执行动作{a_t}之后,遵循策略 {\pi _{\boldsymbol \theta} } 能够获得的期望奖励,如式(30)所示:
{Q_{{\pi _{\boldsymbol \theta} }}}\left( {{s_t},{a_t}} \right) = {E_{{\pi _{\boldsymbol \theta} }}}\left[ {\displaystyle\sum\limits_{i = 0}^\infty {{\gamma ^i}} {r_{t + i + 1}}|s = {s_t},a = {a_t}} \right]. (30) 优势函数是指当前状态下采取某个动作相对于采取平均策略的优势,计算方式如式(31)所示:
{\hat A_t} = {Q_{{\pi _{\boldsymbol \theta} }}}\left( {{s_t},{a_t}} \right) - V_{{\pi _{\boldsymbol \theta} }}^{\boldsymbol \phi} \left( {{s_t}} \right). (31) 优势函数 {\hat A_t} 越大,意味着当前动作相对于平均水平更好. 在训练过程中,智能体倾向于更频繁地选择具有更大优势的动作. 这可能表明在当前状态下采取该动作更有可能获得更高的奖励或更好的长期奖励.
网络参数{\boldsymbol \theta} 通过式(32)进行更新.
{\boldsymbol \theta} = {{\boldsymbol \theta} _{{\text{old}}}} + \delta {{\hat {\boldsymbol g}}_{\text{actor}}}, (32) 其中{\boldsymbol \theta} 和{{\boldsymbol \theta} _{{\text{old}}}}分别代表新旧策略的参数. \delta 代表参数学习率,表示参数更新的快慢. {\hat {\boldsymbol g}_{\text{actor}}} 是策略梯度,用以更新参数的依据,如式(33)所示:
\begin{split} {{\hat {\boldsymbol g}}_{\text{actor}}} = & {\nabla _{\boldsymbol \theta} }{L^{clip}}({\boldsymbol \theta} ) = \\ & {E_t}\Bigg[ {\nabla _{\boldsymbol \theta} }\Bigg( \min \Bigg( \dfrac{{{\pi _{\boldsymbol \theta} }({a_t}|{s_t})}}{{{\pi _{{\boldsymbol \theta} _{\text{old}}}}({a_t}|{s_t})}}{{\hat A}_t},clip\Bigg( \dfrac{{{\pi _{\boldsymbol \theta} }({a_t}|{s_t})}}{{{\pi _{{\boldsymbol \theta} _{\text{old}}}}({a_t}|{s_t})}},1 -\\ &{\varepsilon} ,1 + {\varepsilon} \Bigg){{\hat A}_t} \Bigg) \Bigg) \Bigg]. \\[-1pt] \end{split} (33) 函数clip的作用是将新旧策略的比值限制在区间\left[ {1 - { \varepsilon} ,1 + { \varepsilon} } \right]内,避免了更新步长过大引起的不稳定性,增强了算法的收敛性. 具体地,当优势函数{\hat A_t}为正值时,需要增大新旧策略的比值,而比值大于1 + { \varepsilon} 时,将不提供额外的激励;当优势函数{\hat A_t}为负值时,需要减少新旧策略的比值,而比值小于1 - { \varepsilon} 时,将不再提供额外的激励.
对于Critic网络,网络参数{\boldsymbol \phi} 通过式(34)更新:
{\boldsymbol \phi} ' = {\boldsymbol \phi} - \eta {\hat g_{\text{critic}}}, (34) 其中{\boldsymbol \phi} 和{\boldsymbol \phi} '分别代表Critic网络更新前后的网络参数,\eta 代表参数学习率. 关于策略梯度 {\hat {\boldsymbol g}_{\text{critic}}} ,利用均方误差进行计算,如式(35)所示:
\begin{split} {{\hat {\boldsymbol g}}_{\text{critic}}} =& {\nabla _{\boldsymbol \phi} }{L^{\rm BL}}\left( {\boldsymbol \phi} \right) = \\ & {E_t}{\nabla _{\boldsymbol \phi} }\left( {{{\left( {\displaystyle\sum\limits_{i = 0}^\infty {{\gamma ^i}} {r_{t + i + 1}} - {V_{\boldsymbol \phi} }\left( {{s_t}} \right)} \right)}^2}} \right) = \\ & {E_t}\left[ {2\left( {\displaystyle\sum\limits_{i = 0}^\infty {{\gamma ^i}} {r_{t + i + 1}} - {V_{\boldsymbol \phi} }\left( {{s_t}} \right)} \right){\nabla _{\boldsymbol \phi} }{V_{\boldsymbol \phi} }\left( {{s_t}} \right)} \right]. \end{split} (35) PPO-AM算法伪代码如下所示:
算法2. PPO-AM算法.
输入:车辆状态和信道状态的集合 S ,Actor网络参数{\boldsymbol \theta} ,Critic网络参数{\boldsymbol \phi} ,最大训练次数J,每局的时隙数 T ,V2V链路数{k_{\max }},奖励折扣因子\gamma ,学习率\alpha ,经验回放缓冲区D,经验回放缓冲区大小 {D_{\max }} ,网络参数更新次数 {L_{{\text{epoch}}}} ;
输出:训练完成的策略{\pi _{\boldsymbol \theta} }.
① 随机初始化Actor网络及其参数、Critic网络 及其参数;
② for j = 1 to J do
③ 初始化环境;
④ {\pi _{{\boldsymbol \theta} _{\text{old}}}} \leftarrow {\pi _{\boldsymbol \theta} } ;
⑤ for t = 1 to T do
⑥ for k = 1 to {k_{\max }} do
⑦ s_k^t \leftarrow getState(k);
⑧ 通过Actor网络得到策略{\pi _{\boldsymbol \theta} },进而得到 动作a_k^t;
⑨ 当前V2V链路执行动作a_k^t,计算奖励r_k^t;
⑩ 将数据 \left\{ {s_k^t,a_k^t,r_k^t,s_k^{t{\text{ + 1}}}} \right\} 存入D;
⑪ end for
⑫ end for
⑬ if \left| D \right| = {D_{\max }}
⑭ for l to {L_{{\text{epoch}}}}
⑮ 计算{Q_{{\pi _{\boldsymbol \theta} }}};/*式(30)*/
⑯ 计算 V_{{\pi _{\boldsymbol \theta} }}^{\boldsymbol \phi} ({s_t}) , {\hat A_t} ;/*式(29)(31)*/
⑰ 更新Actor网络参数{\boldsymbol \theta} 和Critic网络参 数{\boldsymbol \phi} ;/*式(32)~(35)*/
⑱ end for
⑲ end if
⑳ end for
3.5 基于进化学习的多目标主问题训练算法
在第2阶段,首先将第1阶段训练的子问题模型集作为初始种群. 在每一代,每个单独的模型通过近端变异生成1个后代,称为子模型,该后代采用SM-G-SUM(safe mutation through gradients-summed gradient variant)变异算子[33]将缩放的高斯扰动添加到模型参数 {\boldsymbol \theta} 中,如式(36)所示:
{\boldsymbol \theta} ' = {\boldsymbol \theta} + \dfrac{x}{{\boldsymbol \tau } },x \sim {{\mathcal{N}}}\left( {0,\mu } \right) , (36) 其中 \mu 是变异幅度超参数,{\boldsymbol \tau } 是子模型网络参数的敏感度向量[33].
具体来说,子问题模型集\varPi 作为初始种群输入到算法3中,\varPi 中每个模型作为进化学习中的“个体”. 然后,计算种群中个体的适应度向量(目标向量),并用适应度向量标记个体. 最后,执行基于多目标的选择. 具体地,采用NSGA-Ⅱ根据适应度向量的非支配水平和拥挤距离对所有个体进行排序,并保留N个个体. 算法3是进化学习的伪代码. 基于进化学习的训练算法过程如图4所示.
算法3. 进化学习.
输入:子问题模型集 \varPi = \left\{ {{\pi _1},{\pi _2}, … ,{\pi _N}} \right\} ,其中模型 {\pi _i} 有参数 {{\boldsymbol \theta} _i} ,训练批大小为 {B_{\rm e}} ,最大迭代次数为 {I_{\max }} ,权重向量个数为N,非支配策略集合EP;
输出:优化后的模型集 {\varPi ^ * } = \left\{ {\pi _1^ * ,\pi _2^ * , … ,\pi _N^ * } \right\} 和非支配策略集合EP.
① 初始化非支配策略集合 EP = \varnothing ;
② for i = 1,2, … ,N do
③ {{\boldsymbol{f}}_{{\pi _i}}} = Evaluate\left( {{\pi _i}} \right) ;
④ end for
⑤ for g = 1,2, … ,{I_{\max }} do
⑥ for i = 1,2, … ,N do
⑦ 对训练数据随机采样 {G_j} = RandomInstance(\; ) ,\forall j \in \left\{ {1,2, … ,{B_{\rm e}}} \right\};
⑧ 计算敏感度向量{\boldsymbol \tau} ;
⑨ {{\boldsymbol \theta} '_i} = {{\boldsymbol \theta} _i} + \dfrac{x}{{\boldsymbol \tau} } ;
⑩ 产生带有参数{{\boldsymbol \theta} '_i} 的后代{\pi'_i} ,并且 {{\boldsymbol f}_{{\pi_i'}}} = Evaluate\left( {{\pi_i'}} \right) ;
⑪ end for
⑫ 根据适应度向量的非支配水平和拥挤距离 对个体进行排序,并选择N个个体作为 下一代;
⑬ 根据 {\pi '_i} 更新非支配策略集合EP;
⑭ end for
⑮ return {\varPi ^ * } = \left\{ {\pi _1^ * ,\pi _2^ * , … ,\pi _N^ * } \right\} .
4. 实验设置与结果分析
本节包括实验环境设置、实验对比算法、实验评估指标和实验结果分析.
4.1 实验环境设置
实验基于TensorFlow 2.7框架和Python 3.9环境,采用NVIDIA GeForce GTX
1650 GPU和Intel Core i5-10400 CPU. 仿真实验中的系统道路模型参考3GPP TR 36.885[34]的城市案例设计,如表2所示. 环境中相关参数的设定参考3GPP TR36.885中曼哈顿参数,如表3所示. 参考该模型,设计得到的道路模型如图5所示.表 2 道路模型参数Table 2. Road Model Parameters参数 城市案例 车道数 每个方向2个车道,每条街道共计4个车道 车辆数量 60 车道宽度/m 3.5 道路网格大小 433\; {\text{m}} \times 250 \;{\text{m}} 仿真区域大小 1\;299\;{\text{m}} \times 750\;{\text{m}} 平均车速/(m·s−1) 15 车辆直行概率 0.5 车辆左转概率 0.25 车辆右转概率 0.25 表 3 仿真系统参数Table 3. Simulation System Parameters参数 取值 载波频率/GHz 2 信道带宽/MHz 1.5 V2V链路负载/Kb 30 快衰落模型 Rayleigh衰落 阴影衰落标准差/dB 3 阴影衰落分布 对数正态分布 基站天线高度/m 25 基站天线增益/dBi 8 基站接收噪声/dB 5 车辆天线高度/m 1.5 车辆天线增益/dBi 3 车辆接收噪声/dB 9 V2N传输功率/dBm 23,10,5 4.2 对比算法
本文设置如下的对比算法以验证所提出算法的性能:
1)随机化资源分配算法RP(random policy). 在每个时隙,智能体为车辆用户对随机分配传输信道和发射功率.
2)低延时高可靠性优化的深度强化学习资源分配算法LHP(low-latency high-reliability policy)[24]. 该算法利用深度强化学习DQN算法,在V2V与V2N共享资源的前提下,通过资源分配实现 V2V链路的延迟、可靠性以及V2N链路传输速率的优化.
3)基于AoI的低延时高可靠性优化的深度强化学习资源分配算法LHP-A(low-latency high-reliability policy based on AoI). 该算法是在文献[24]原有目标基础上添加了AoI的优化目标,使得智能体在决策时额外考虑了链路传输信息的时效性.
4)动态邻近感知资源分配算法DPP(dynamic proximity-aware policy)[35]. 该算法利用多对1匹配博弈算法实现V2V链路资源分配以最小化V2V链路的延迟.
4.3 评估指标
仿真实验所采用的性能指标如下所示:
1)V2N传输速率{C^{\rm c}}. V2N传输速率描述了V2N链路在单位时间内传输的数据量. V2N传输速率越高,表示系统在单位时间内能够传输信息就越多. 这一指标可由式(6)计算得到.
2)V2V链路平均AoI {\bar A^t} . V2V链路平均AoI表征车辆获取数据的时效性. V2V链路平均AoI值越低,表示链路传输的信息的时效性越强,即传输的信息越新鲜. 这一指标可由式(13)计算得到.
3)V2V链路传输成功率{p_{{\text{success}}}}. V2V链路传输成功率是指V2V链路传输过程中,满足延迟约束的数据包所占的比例. V2V链路数据包传输的成功率越高,表示传输越可靠、满足延迟约束的效果越好. 计算方式如式(37)所示:
{p_{{\text{success}}}} = \dfrac{1}{{{k_{\max }}}}\displaystyle\sum\limits_{k = 1}^{{k_{\max }}} {{p_k}} . (37) 4)决策延迟时间{t_{\text{d}}}[36]. 决策延迟时间是衡量智能体响应速度的性能指标,是指从发起请求到决策所经历的平均等待时间,计算方式如式(38)所示:
{t_{\text{d}}} = \dfrac{1}{{{n_{\text{d}}}}}\displaystyle\sum\limits_{i = 1}^{{n_{\text{d}}}} {\left( {t_i^{\text{s}} - t_i^{\text{e}}} \right)} , (38) 其中 t_i^{\text{s}} 和 t_i^{\text{e}} 分表代表第i次决策的发起时刻和决策时刻, {n_{\text{d}}} 表示决策的总次数.
4.4 算法性能对比
1)收敛性
图6展示了子问题各算法训练过程. 随着训练轮数的增加,各算法的累计奖励均逐渐增大. 其中LHP-A算法在400轮左右达到收敛,PPO算法在350轮左右达到收敛,PPO-AM算法在200轮左右就已收敛,较前二者分别提速约50.0%和42.9%,收敛后的平均累计奖励分别优化约5.82%和19.41%. 这主要归功于PPO-AM算法引入了注意力机制,该机制使得模型在训练过程中能够更加精准地聚焦重要的状态信息,从而加速了对环境关键特征的识别与学习. 注意力机制有效地促进了模型快速捕捉到对累积奖励优化有正面影响的特征,进而加快了整个算法的收敛性.
2)不同车辆数量下各算法性能对比
车辆数量设置为20,40,60,80,100,120,基于表2和表3中的参数,以探究在不同交通密度条件下各算法的性能表现,如图7~9所示.
图7展示了不同车辆数量下各算法的V2V链路平均AoI. 就整体而言,AoI会随着车辆数量的增多呈上升趋势,原因是车辆数量增大会不可避免地导致资源块的竞争变得更加激烈,数据包的传输时间变长,导致链路的AoI会变大. 就不同车辆数量而言,当车辆数量为20辆时,3种算法的AoI性能差别不大,这是因为车辆数量较少时,资源竞争并不明显. 当车辆数为40辆时,资源竞争开始显现,导致RP算法的性能显著下降,其AoI开始急剧上升. 相比之下,LHP-A算法虽然也表现出AoI的增长,但其增长速度相对较慢,而MOPPO-AM算法显示出在资源受限环境下更为稳健的性能. 当车辆数量继续增加时,MOPPO-AM算法仍处于较优水平,AoI增长速度最小. 就平均性能而言,MOPPO-AM算法的AoI较RP算法减少54.4%,较LHP-A算法减少12%,说明MOPPO-AM算法在满足V2V链路传输信息时效性方面存在优势.
图8展示了不同车辆数量下各算法的V2V链路传输成功率. 随着车辆数量的增加,各算法的V2V链路传输成功率均出现了不同程度的下降趋势,MOPPO-AM算法下降趋势与其他算法相比较为平稳,但在车辆数为80时,成功率出现了较大幅度的下降,原因是车辆数量过多导致不能满足时间约束的数据传输增多. 车辆数量从40直到更多,MOPPO-AM算法相较对比算法的优势更为明显,与LHP-A算法相比,V2V链路传输成功率平均高出2.1个百分点,最多可达5.1个百分点. 此外,与LHP算法和DPP算法相比,MOPPO-AM的V2V链路传输成功率分别高出3.1个百分点和4.6个百分点. 这表明在对比算法中,MOPPO-AM算法在V2V链路传输成功率方面处于较优水平,在保证V2V链路传输稳定性方面存在优势.
图9展示了不同车辆数量下各算法的V2N链路传输速率. 在不同车辆数量的环境模拟中,MOPPO-AM算法均处于较优水平:较RP算法平均提高约103%,较DPP算法提高约16.7%,较LHP算法和LHP-A算法分别提高了12.1%和16.4%.
综合图7~9可知,在环境车辆数量不断增加的情况下,MOPPO-AM算法可以将车辆数量对性能数据的影响降到最低,由此说明MOPPO-AM算法能够更好地适应不同车辆数量的交通场景.
图10展示了不同车辆数量下各算法决策延迟时间. 由图10(a)可知,在决策延迟时间方面,传统算法DPP,RP和深度强化学习算法之间存在明显差异. 随着车辆数量的不断增加,DPP算法的决策延迟时间呈现出显著增长的趋势,而深度强化学习算法的决策延迟时间则表现出较好的稳定性. 这表明深度强化学习算法在处理大规模动态场景,尤其是车辆密集的交通环境时,相较于DPP算法具有明显的优势. 深度强化学习算法能够更加迅速且准确地做出决策,这不仅提升了决策的响应速度,也有助于实现高效、智能的交通管理. 图10(b)展示的是除DPP算法以外的其他算法决策延迟时间在不同车辆数量下的变化曲线. RP算法所做出的决策都是随机的,无需与环境交互,故决策时间最短. 其他3种算法的决策延迟时间均呈现稳定态势. LHP算法和LHP-A算法的决策延迟时间稳定在8.53 ms和9.57 ms左右;MOPPO-AM算法的决策延迟时间稳定在7.63 ms左右,在对比算法中处于最优水平,较LHP算法和LHP-A算法决策时间分别缩短10.6%和20.3%左右.
3)不同链路负载下各算法性能对比
V2V链路负载设置为26 Kb,28 Kb,30 Kb,32 Kb,34 Kb,基于表2和表3中的参数评估在不同链路负载下各算法的性能表现,如图11~13所示.
随着V2V链路负载持续增加,各算法的性能指标均呈现出变差的趋势,这是因为V2V链路负载的增大意味着链路传输任务消耗的资源也随之增多,在相同传输速度情况下,完成V2V链路传输任务所耗费的时间更多. 此外,V2V链路负载的增加还加剧了链路间对有限资源的竞争,导致性能指标的进一步恶化.
图11展示了不同链路负载下各算法的V2V链路平均AoI.2种深度强化学习算法在不同的V2V链路负载的情况下均优于RP算法. 在链路负载为26 Kb和28 Kb时,MOPPO-AM算法和LHP-A算法的AoI性能接近. 但是,随着V2V链路负载不断增加,LHP-A算法的AoI有缓慢增大的趋势,而MOPPO-AM算法的AoI始终维持在0.12左右. 在AoI方面,MOPPO-AM算法较LHP-A算法平均提高15%,较RP算法平均提高58%.
图12展示了不同链路负载下各算法V2V链路传输成功率. 随着V2V链路负载的不断增加,5种算法的V2V链路传输成功率均呈现下降趋势. 其中RP算法的表现最差,平均V2V链路传输成功率只有0.83;LHP-A算法、LHP算法和DPP算法次优,分别为0.94,0.93,0.91; MOPPO-AM算法的表现最好,V2V链路传输成功率的下降速度最慢,平均水平可以达到0.96.
图13展示了不同链路负载下各算法V2N链路传输速率. 随着V2V链路负载的不断增加,V2V链路所分配的资源逐渐增多,这加剧了对V2N链路的干扰,使得V2N链路传输速率也随之降低. 但MOPPO-AM算法具有更高的V2N链路传输速率,平均值为140.16 Mbps,较次优的LHP算法平均提高12.79%.
综合图11~13可知,在V2V链路负载逐渐增加的情况下,MOPPO-AM算法仍具有较低的AoI、较高的V2V链路传输成功率和V2N链路传输速率,说明MOPPO-AM算法可以更好地完成多种负载大小的传输任务,且在完成传输任务的同时最小化对V2N链路的干扰.
4)不同信道带宽下各算法性能对比
信道带宽设置为1.00 MHz,1.25 MHz,1.50 MHz,1.75 MHz,2.00 MHz,基于表2和表3中的参数,以评估在不同信道带宽下各算法的性能表现,如图14~16所示.
随着信道带宽的不断增加,各算法的性能指标呈现出稳步提升的趋势,这是因为信道带宽的增加意味着可分配的传输资源变多,在V2V链路负载和延迟约束保持不变的情况下,V2V链路可分配到的传输资源变多,链路传输速率会有所提升,完成传输任务所需要的时间相应变短,所以性能数据呈现好转趋势.
图14展示了不同信道带宽下各算法的V2V链路平均AoI. 虽然MOPPO-AM算法的AoI下降趋势不比RP算法的明显,但MOPPO-AM算法的AoI在各种信道带宽的情况下都处于最优水平,平均值为0.122,MOPPO-AM算法较RP算法平均减少58.83%,较LHP-A算法平均减少17.72%.
图15展示了不同信道带宽下各算法的V2V链路传输成功率. 随着信道带宽的不断增加,各算法的V2V链路传输成功率均出现了不同程度的上升,其中LHP算法的涨幅最大,上涨了13.91个百分点,MOPPO-AM算法涨幅最小,上涨5.44个百分点. 但MOPPO-AM算法平均表现仍处于较优水平,平均V2V链路传输成功率为96.16%,较涨幅最大的LHP平均提高7.5个百分点.
图16展示了不同信道带宽下各算法的V2N链路传输速率. 在信道带宽较低的情况下,其他对比算法的V2N链路传输速率普遍偏低,但MOPPO-AM算法仍具有较高且稳定的V2N链路传输速率. MOPPO-AM算法在所有信道带宽下V2N链路传输速率的平均值为139.01 Mbps,在所有对比算法中处于最优水平,较RP算法平均提高118.89%,较DPP算法平均提高21.57%,较LHP算法和LHP-A算法分别平均提高13.2%和11.4%.
综合图14~16可知,在信道带宽较高、可分配的传输资源较充足的情况下,各算法的性能指标之间的差距不大. 在传输资源相对匮乏的情况下,MOPPO-AM算法的性能也出现了下降的趋势,与其他算法相比,MOPPO-AM算法性能下降幅度较小且比较稳定,在信道带宽为1.00 MHz和1.25 MHz时尤为明显. 由此可见,MOPPO-AM算法对信道带宽的变化具有更强的鲁棒性和适应性.
综上所述,实验结果充分证明了MOPPO-AM算法在学习能力和环境感知能力方面的出色表现. 该算法不仅能够有效地处理多目标优化问题,还能通过注意力机制处理与任务最相关的状态信息,从而在资源受限或竞争激烈的环境中展现显著优势,并得出3个结论:
1)注意力机制加速收敛. 基于注意力机制的PPO-AM算法能够更精准地聚焦关键状态特征,减少模型处理不相关信息的复杂度,提高数据利用效率,从而显著加快了收敛速度,有效缩短了训练周期. 同时,PPO通过截断函数clip限制策略更新的幅度,保证了训练过程的稳定性.
2)多目标优化均衡模型训练效果. 相较于传统的单目标优化算法,基于进化学习的多目标优化MOPPO-AM算法在多个关键性能指标上展现出显著优势,包括V2N链路传输速率、V2V链路传输速率、V2V链路延迟、V2V链路平均AoI以及决策延迟时间. 这是因为MOPPO-AM算法可以均衡V2V链路和V2N链路不同的优化目标,能够更好地满足车联网场景中业务类型的多样化需求.
3)大规模动态复杂场景的高效决策能力. MOPPO-AM算法结合了多目标优化和深度强化学习的优势,显著提升了智能体的探索能力和快速响应能力. 特别是在通信资源竞争加剧的情况下,MOPPO-AM算法依然展现出良好的鲁棒性和适应性,确保了决策过程的高效性和稳定性.
5. 结束语
针对蜂窝车联网环境中V2V 链路和 V2N 链路共享无线资源以满足不同性能指标的问题,建立了多目标优化无线资源分配数学规划模型,设计了一种基于进化学习的多目标深度强化学习决策框架求解该问题. 仿真结果表明,该算法保证了智能体在与环境不断交互的过程中快速学习V2V无线资源分配策略,有效解决了动态不确定蜂窝车联网环境下的资源分配问题,旨在实现优化目标V2V链路的性能(即信息年龄、延迟以及传输速率)和V2N链路传输速率之间的权衡. 研究成果不仅提高了蜂窝车联网管控的自动化与自主化效率,而且简化了管控流程与降低人员管理成本,也适用于其他大规模动态复杂网络部署与管理. 后续将研究基于多目标联邦强化学习算法,以解决车联网数据隐私保护场景下的无线资源分配问题.
作者贡献声明:李可负责指导选题、问题建模、算法设计、撰写与修改论文;马赛负责搜集文献资料、实现论文算法、整理实验数据、撰写论文;戴朋林负责指导实验实施、修改论文;任婧和范平志负责网络架构设计和修改论文.
-
表 1 VideoQA综述工作对比
Table 1 Comparison of VideoQA Survey Works
表 2 各数据集指标对比
Table 2 Comparison of Indicators of Each Data Set
数据集 年份 数据源 视频数 片段数 平均长度/s 问答对 问答类型 问答生成 MovieQA[80] 2016 电影 408 6771 202 14944 选择题 人工 LSMDC-QA[143] 2017 M-VAD/MPII-MD 202 118081 200 118114 选择题 人工 MovieFIB[139] 2017 LSMDC2016 180 118507 4.1 348998 填空题 自动 YouTube2Text-QA[55] 2017 YouTube2Text 1987 1987 40 99421 选择题/开放问题 自动 TGIF-QA[45] 2017 TGIF 71741 3 165165 选择题/开放问题 人工/自动 MSRVTT-QA[48] 2017 MSRVTT 7000 10000 15 243000 开放问题 自动 MSVD-QA[48] 2017 MSVD 1970 1970 10 50505 开放问题 自动 Video-QA[79] 2017 在线网络视频 18100 18100 90 175076 开放问题 自动 MarioQA[140] 2017 游戏视频 187757 选择题 自动 PororoQA[31] 2017 卡通视频 171 16066 8913 选择题 人工 TVQA[69] 2018 电视剧 925 21793 76 152545 选择题 人工 SVQA[59] 2018 Unity3D生成 12000 12000 118680 开放问题 自动 TVQA+[52] 2019 TVQA 279 4198 60~90 29383 选择题 人工 KnowIT VQA[137] 2019 电视剧 207 12087 20 24000 选择题 人工 Activitynet-QA[142] 2019 ActivityNet 5800 5800 180 58000 开放问题 人工 EgoVQA[144] 2019 IU Multiview 16 520 20~100 600 选择题/开放问题 人工 Social-IQ[145] 2019 YouTube 1250 1250 7500 选择题 人工 DramaQA[146] 2020 电视剧 18 23928 3.6 17983 选择题 人工 LifeQA[147] 2020 YouTube 59 275 74 2326 选择题 人工 Tutorial-VQA[148] 2020 网络教学视频 76 408 6195 开放问题 人工 How2QA[117] 2020 HowTo100M/TV 9035 22000 60 44007 选择题 人工 Env-QA[53] 2021 模拟器生成 23261 85072 选择题 自动 CLEVRER[132] 2021 模拟器生成 20000 20000 5 300000 选择题/开放问题 自动 TrafficQA[136] 2021 交通视频 10080 10080 62535 选择题 人工 NExT-QA[149] 2021 YFCC-100M 5440 5440 44 52044 选择题/开放问题 人工 AGQA[150] 2021 Action genome 9601 9601 30 36000000 选择题/开放问题 自动 STAR[151] 2021 Charades 22000 30 60000 选择题 自动 Fill-in-the-Blank[36] 2022 VaTeX 28000 28000 10 28000 填空题 人工 CRAFT[141] 2022 模拟器生成 9917 57524 10 57524 选择题 自动 EgoTaskQA[152] 2022 LEMMA 2000 2000 45 40000 选择题/开放问题 自动 表 3 主流模型在MovieQA上的性能表现
Table 3 Performance of Mainstream Models on MovieQA
% 表 4 主流模型在TVQA上的性能表现
Table 4 Performance of Mainstream Models on TVQA
% 表 5 主流模型在TGIF-QA上的性能表现
Table 5 Performance of Mainstream Models on TGIF-QA
模型 重复动作
/%状态转换
/%帧问答
/%计数
损失ST-VQA[45] 60.8 67.1 49.3 4.28 Co-Mem[34] 68.2 74.3 51.5 4.10 PSAC[67] 70.4 76.9 55.7 4.27 LAD-Net[66] 69.9 78.4 57.5 4.32 STA[65] 72.3 79.0 56.6 4.25 Jin等人[72] 72.7 80.9 57.1 4.17 HME[84] 73.9 77.8 53.8 4.02 L-GCN[91] 74.3 81.1 56.3 3.95 HGA[92] 75.4 81.0 55.1 4.09 HCRN[129] 75.0 81.4 55.9 3.82 HOSTR[130] 75.0 83.0 58.0 3.65 FAM[81] 75.4 79.2 56.9 3.79 QueST[60] 75.9 81.0 59.7 4.19 Bridge2Answer[94] 75.9 82.6 57.5 3.71 TPT[109] 76.6 81.6 57.8 3.63 HAIR[101] 77.8 82.3 60.0 3.88 MSPAN[99] 78.4 83.3 59.7 3.57 HQGA[131] 76.9 85.6 61.3 CoCo-BERT[123] 78.3 85.6 61.1 3.78 SiaSamRea[121] 79.7 85.3 60.2 3.61 PGAT[100] 80.6 85.7 61.1 3.96 CLIPBERT[120] 82.8 87.8 60.3 VGNMN[27] 84.5 88.7 74.7 2.65 VIOLET[126] 92.5 95.7 68.9 MERLOT[118] 94.0 96.2 69.5 表 6 主流模型在MSRVTT-QA和MSVD-QA上的性能表现
Table 6 Performance of Mainstream Models on MSRVTT-QA and MSVD-QA
% 模型 MSRVTT-QA MSVD-QA What Who How When Where All What Who How When Where All E-VQA[48] 18.9 38.7 83.5 70.5 29.2 26.4 9.7 42.2 83.8 72.4 53.6 23.3 E-SA[48] 22.0 41.6 79.6 73.1 33.2 29.3 15.0 45.1 83.8 65.5 32.2 27.6 E-MN[48] 23.4 41.8 83.7 70.8 27.6 30.4 12.9 46.5 80.3 70.7 50.0 26.7 AMU[48] 26.2 43.0 80.2 72.5 30.0 32.5 20.6 47.5 83.5 72.4 53.6 32.0 HRA[56] 35.1 34.4 HME[84] 26.5 43.6 82.4 76 28.6 33 22.4 50.1 73 70.7 42.9 33.7 L-GCN[91] 34.3 Jin等人[72] 29.5 45.0 83.2 74.7 42.4 35.4 24.2 49.5 83.8 74.1 53.6 35.0 QueST[60] 27.9 45.6 83 75.7 31.6 34.6 24.5 52.9 79.1 72.4 50 36.1 FAM[81] 26.9 43.9 82.8 70.6 31.1 33.2 23.1 51.6 82.2 71.4 51.9 34.5 SSML[122] 35.1 35.1 TSN[33] 27.9 46.1 84.1 77.8 37.6 35.4 25.0 51.3 83.8 78.4 59.1 36.7 HGA[92] 29.2 45.7 83.5 75.2 34.0 35.5 23.5 50.4 83.0 72.4 46.4 34.7 MHMAN[86] 28.7 47.1 85.1 77.1 35.2 35.6 23.3 50.7 84.1 72.4 53.6 34.6 HCRN[129] 35.6 36.1 ActBERT[37] 29.4 45.6 79.8 76.7 36.4 35.5 28.7 53.8 80.0 70.7 46.4 39.0 HOSTR[130] 35.9 39.4 Bridge2Answer[94] 36.9 37.2 OCRL+LOGNet[97] 36.0 38.2 HAIR[101] 37.5 36.9 CLIPBERT[120] 37.4 PGAT[100] 38.1 39.0 TPT[109] 38.5 37.7 MSPAN[99] 31.9 47.2 83.2 77.5 38.4 37.8 31.0 53.8 77.0 72.1 53.6 40.3 HQGA[131] 32.5 48.9 81.5 78.3 38.4 38.6 30.4 57.2 76.2 75.9 32.1 41.2 CoMVT[124] 39.5 42.6 SiaSamRea[121] 41.6 45.5 VQA-T[116] 41.5 46.3 VIOLET[126] 43.9 47.9 LiVLR[98] 50.3 77.1 94.2 81.3 48.4 59.4 -
[1] 俞俊,汪亮,余宙. 视觉问答技术研究[J]. 计算机研究与发展,2018,55(9):1946−1958 doi: 10.7544/issn1000-1239.2018.20180168 Yu Jun, Wang Liang, Yu Zhou. Research on visual question answering technology[J]. Journal of Computer Research and Development, 2018, 55(9): 1946−1958 (in Chinese) doi: 10.7544/issn1000-1239.2018.20180168
[2] Antol S, Agrawal A, Lu Jiasen, et al. VQA: Visual question answering[C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2015: 2425−2433
[3] Yang Zichao, He Xiaodong, Gao Jianfeng, et al. Stacked attention networks for image question answering[C]//Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 21−29
[4] Qiao Tingting, Dong Jianfeng, Xu Duanqing. Exploring human-like attention supervision in visual question answering[C]//Proc of the 32nd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2018: 7300−7307
[5] Yu Zhou, Yu Jun, Cui Yuhao, et al. Deep modular co-attention networks for visual question answering[C]//Proc of the 32nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 6281−6290
[6] Luo Haozheng, Qin Ruiyang. Open-ended multi-modal relational reason for video question answering[J]. arXiv preprint, arXiv: 2012. 00822, 2020
[7] 包希港,周春来,肖克晶,等. 视觉问答研究综述[J]. 软件学报,2021,32(8):2522−2544 doi: 10.13328/j.cnki.jos.006215 Bao Xigang, Zhou Chunlai, Xiao Kejing, et al. Review of visual question answering research[J]. Journal of Software, 2021, 32(8): 2522−2544 (in Chinese) doi: 10.13328/j.cnki.jos.006215
[8] Patel D, Parikh R, Shastri Y. Recent advances in video question answering: A review of datasets and methods [C]//Proc of the Int Conf on Pattern Recognition. Berlin: Springer, 2021: 339−356
[9] Khurana K, Deshpande U. Video question-answering techniques, benchmark datasets and evaluation metrics leveraging video captioning: A comprehensive survey [J]. IEEE Access, 2021, 9: 43799−43823
[10] Sun Guanglu, Liang Lili, Li Tianlin, et al. Video question answering: A survey of models and datasets[J]. Mobile Networks and Applications, 2021, 26(5): 1904−1937 doi: 10.1007/s11036-020-01730-0
[11] Mnih V, Heess N, Graves A. Recurrent models of visual attention [C]// Proc of the 27th Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2014: 2204−2212
[12] Weston J, Chopra S, Bordes A. Memory networks[J]. arXiv preprint, arXiv: 1410. 3916, 2014
[13] Scarselli F, Gori M, Tsoi A C, et al. The graph neural network model[J]. IEEE Transactions on Neural Networks, 2008, 20(1): 61−80
[14] Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks[J]. arXiv preprint, arXiv: 1609. 02907, 2016
[15] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [C]// Proc of the 30th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2017: 5998−6008
[16] Devlin J, Chang Mingwei, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding [C]//Proc of the 17th Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2019: 4171−4186
[17] Ren Shaoqing, He Kaiming, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(6): 1137−1149
[18] Deng Jia, Dong Wei, Socher R, et al. ImageNet: A large-scale hierarchical image database [C]//Proc of the 22nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2009: 248−255
[19] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint, arXiv: 1409. 1556, 2014
[20] Szegedy C, Liu Wei, Jia Yangqing, et al. Going deeper with convolutions [C]//Proc of the 28th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 1−9
[21] He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition [C]//Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 770−778
[22] Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2015: 4489−4497
[23] Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset [C]//Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 6299−6308
[24] Xie Saining, Sun Chen, Huang Jonathan, et al. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification [C]//Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 305−321
[25] Hara K, Kataoka H, Satoh Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? [C]//Proc of the 31st IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 6546−6555
[26] Feichtenhofer C, Fan Haoqi, Malik J, et al. SlowFast networks for video recognition [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 6202−6211
[27] Le H, Chen N F, Hoi S C H. VGNMN: Video-grounded neural module network to video-grounded language tasks[J]. arXiv preprint, arXiv: 2104. 07921, 2021
[28] Shah A, Lin T H, Wu Shijie. Triple attention network architecture for MovieQA[J]. arXiv preprint, arXiv: 2111. 09531, 2021
[29] Aytar Y, Vondrick C, Torralba A. SoundNet: Learning sound representations from unlabeled video [C]//Proc of the 29th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2016: 892−900
[30] Kumar A, Khadkevich M, Fügen C. Knowledge transfer from weakly labeled audio using convolutional neural network for sound events and scenes [C]//Proc of the 44th IEEE Int Conf on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2018: 326−330
[31] Kim K M, Heo M O, Choi S H, et al. Deepstory: Video story QA by deep embedded memory networks [C]//Proc of the 26th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2017: 2016−2022
[32] Na S, Lee S, Kim J, et al. A read-write memory network for movie story understanding [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 677−685
[33] Yang Tianhao, Zha Zhengjun, Xie Hongtao, et al. Question-aware tube-switch network for video question answering [C]//Proc of the 27th ACM Int Conf on Multimedia. New York: ACM, 2019: 1184−1192
[34] Gao Jiyang, Ge Runzhou, Chen Kan, et al. Motion-appearance co-memory networks for video question answering [C]//Proc of the 31st IEEE CONF on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 6576−6585
[35] Wang Bo, Xu Youjiang, Han Yahong, et al. Movie question answering: Remembering the textual cues for layered visual contents [C]//Proc of the 32nd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2018: 7380−7387
[36] Castro S, Wang Ruoyao, Huang Pingxuan, et al. FIBER: Fill-in-the-blanks as a challenging video understanding evaluation framework [C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2022: 2925−2940
[37] Zhu Linchao, Yang Yi. Actbert: Learning global-local video-text representations [C]//Proc of the 33rd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 8746−8755
[38] Mikolov T, Chen Kai, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint, arXiv: 1301. 3781, 2013
[39] Pennington J, Socher R, Manning C D. GloVe: Global vectors for word representation [C]//Proc of the 2014 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2014: 1532−1543
[40] Kiros R, Zhu Yukun, Salakhutdinov R R, et al. Skip-Thought vectors [C]//Proc of the 28th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2015: 3294−3302
[41] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735−1780 doi: 10.1162/neco.1997.9.8.1735
[42] Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation [C]//Proc of the 2014 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2014: 1724−1734
[43] Zhao Zhou, Lin Jinghao, Jiang Xinghua, et al. Video question answering via hierarchical dual-level attention network learning [C]//Proc of the 25th ACM Int Conf on Multimedia. New York: ACM, 2017: 1050−1058
[44] Xue Hongyang, Chu Wenqing, Zhao Zhou, et al. A better way to attend: Attention with trees for video question answering[J]. IEEE Transactions on Image Processing, 2018, 27(11): 5563−5574 doi: 10.1109/TIP.2018.2859820
[45] Jang Y, Song Y, Yu Y, et al. TGIF-QA: Toward spatio-temporal reasoning in visual question answering [C]//Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 2758−2766
[46] Falcon A, Lanz O, Serra G. Data augmentation techniques for the video question answering task [C]//Proc of the 16th European Conf on Computer Vision. Berlin: Springer, 2020: 511−525
[47] Mazaheri A, Zhang Dong, Shah M. Video fill in the blank using LR/RL LSTMs with spatial-temporal attentions [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 1407−1416
[48] Xu Dejing, Zhao Zhou, Xiao Jun, et al. Video question answering via gradually refined attention over appearance and motion [C]//Proc of the 25th ACM Int Conf on Multimedia. New York: ACM, 2017: 1645−1653
[49] Chao Guanlin, Rastogi A, Yavuz S, et al. Learning question-guided video representation for multi-turn video question answering [C]//Proc of the 20th Annual SIGDIAL Meeting on Discourse and Dialogue. Stroudsburg, PA: ACL, 2019: 215−225
[50] Zhao Zhou, Zhang Zhu, Xiao Shuwen, et al. Open-ended long-form video question answering via adaptive hierarchical reinforced networks [C]//Proc of the 27th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2018: 3683−3689
[51] Kim J, Ma M, Kim K, et al. Gaining extra supervision via multi-task learning for multi-modal video question answering [C]//Proc of the 2019 Int Joint Conf on Neural Networks. Piscataway, NJ: IEEE, 2019: 1−8 Kim J,Ma M,Kim K,et al. Gaining extra supervision via multi-task learning for multi-modal video question answering [C]//Proc of the 2019 Int Joint Conf on Neural Networks. Piscataway,NJ:IEEE,2019:1−8
[52] Lei Jie, Yu Licheng, Berg T L, et al. TVQA+: Spatio-temporal grounding for video question answering [C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 8211−8225
[53] Gao Difei, Wang Ruiping, Bai Ziyi, et al. Env-QA: A video question answering benchmark for comprehensive understanding of dynamic environments [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 1675−1685
[54] Yu Y, Kim J, Kim G. A joint sequence fusion model for video question answering and retrieval [C]//Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 471−487
[55] Ye Yunan, Zhao Zhou, Li Yimeng, et al. Video question answering via attribute-augmented attention network learning [C]//Proc of the 40th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2017: 829−832
[56] Chowdhury M I H, Nguyen K, Sridharan S, et al. Hierarchical relational attention for video question answering [C]//Proc of the 25th IEEE Int Conf on Image Processing. Piscataway, NJ: IEEE, 2018: 599−603
[57] Zhao Zhou, Jiang Xinghua, Cai Deng, et al. Multi-turn video question answering via multi-stream hierarchical attention context network [C]//Proc of the 27th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2018: 3690−3696
[58] Zhao Zhou, Yang Qifan, Cai Deng, et al. Video question answering via hierarchical spatio-temporal attention networks [C]//Proc of the 26th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2017: 3518−3524
[59] Song Xiaomeng, Shi Yucheng, Chen Xin, et al. Explore multi-step reasoning in video question answering [C]//Proc of the 26th ACM Int Conf on Multimedia. New York: ACM, 2018: 239−247
[60] Jiang Jianwen, Chen Ziqiang, Lin Haojie, et al. Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering [C]//Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 11101−11108
[61] Liang Junwei, Jiang Lu, Cao Liangliang, et al. Focal visual-text attention for visual question answering [C]//Proc of the 31st IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 6135−6143
[62] Yu Zhou, Yu Jun, Fan Jianping, et al. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 1821−1830
[63] Xue Hongyang, Zhao Zhou, Cai Deng. Unifying the video and question attentions for open-ended video question answering[J]. IEEE Transactions on Image Processing, 2017, 26(12): 5656−5666 doi: 10.1109/TIP.2017.2746267
[64] Chu Wenqing, Xue Hongyang, Zhao Zhou, et al. The forgettable-watcher model for video question answering[J]. Neurocomputing, 2018, 314: 386−393 doi: 10.1016/j.neucom.2018.06.069
[65] Gao Lianli, Zeng Pengpeng, Song Jingkuan, et al. Structured two-stream attention network for video question answering [C]//Proc of the 33rd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2019: 6391−6398
[66] Li Xiangpeng, Gao Lianli, Wang Xuanhan, et al. Learnable aggregating net with diversity learning for video question answering [C]//Proc of the 27th ACM Int Conf on Multimedia. New York: ACM, 2019: 1166−1174
[67] Li Xiangpeng, Song Jingkuan, Gao Lianli, et al. Beyond RNNs: Positional self-attention with co-attention for video question answering [C]//Proc of the 33rd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2019: 8658−8665
[68] Kim K M, Choi S H, Kim J H, et al. Multimodal dual attention memory for video story question answering [C]//Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 673−688
[69] Lei Jie, Yu Licheng, Bansal M, et al. TVQA: Localized, compositional video question answering [C]//Proc of the 2018 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2018: 1369−1379
[70] Li Fangtao, Bai Ting, Cao Chenyu, et al. Relation-aware hierarchical attention framework for video question answering[C]//Proc of the 2021 Int Conf on Multimedia Retrieval. New York: ACM, 2021: 164−172
[71] Kim J, Ma M, Pham T, et al. Modality shifting attention network for multi-modal video question answering [C]//Proc of the 33rd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10106−10115
[72] Jin Weike, Zhao Zhou, Gu Mao, et al. Multi-interaction network with object relation for video question answering [C]//Proc of the 27th ACM Int Conf on Multimedia. New York: ACM, 2019: 1193−1201
[73] Kim H, Tang Zineng, Bansal M. Dense-caption matching and frame-selection gating for temporal localization in VideoQA [C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 4812−4822
[74] Chadha A, Arora G, Kaloty N. iPerceive: Applying common-sense reasoning to multi-modal dense video captioning and video question answering [C]//Proc of the 2021 IEEE Winter Conf on Applications of Computer Vision. Piscataway, NJ: IEEE, 2021: 1−13
[75] Seo M, Kembhavi A, Farhadi A, et al. Bidirectional attention flow for machine comprehension [C/OL]//Proc of the Int Conf on Learning Representations. 2017 [2022-01-10].https://openreview.net/forum? id=HJ0UKP9ge
[76] Yu A W, Dohan D, Luong M T, et al. Qanet: Combining local convolution with global self-attention for reading comprehension [C/OL]//Proc of the Int Conf on Learning Representations. 2018 [2022-01-10].https://openreview.net/forum?id=B14TlG-RW
[77] Veličković P, Cucurull G, Casanova A, et al. Graph attention networks [C/OL]//Proc of the Int Conf on Learning Representations. 2018 [2022-01-12].https://openreview.net/forum?id=rJXMpikCZ
[78] Sukhbaatar S, Weston J, Fergus R. End-to-end memory networks [C]//Proc of the 28th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2015: 2440−2448
[79] Zeng K H, Chen T H, Chuang C Y, et al. Leveraging video descriptions to learn video question answering [C]//Proc of the 31st AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2017: 4334−4340
[80] Tapaswi M, Zhu Yukun, Stiefelhagen R, et al. MovieQA: Understanding stories in movies through question-answering [C]//Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 4631−4640
[81] Cai Jiayin, Yuan Chun, Shi Cheng, et al. Feature augmented memory with global attention network for VideoQA [C]//Proc of the 30th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2021: 998−1004
[82] Kumar A, Irsoy O, Ondruska P, et al. Ask me anything: Dynamic memory networks for natural language processing [C]//Proc of the 33rd Int Conf on Machine Learning. New York: ACM, 2016: 1378−1387
[83] Xiong Caiming, Merity S, Socher R. Dynamic memory networks for visual and textual question answering [C]//Proc of the 33rd Int Conf on Machine Learning. New York: ACM, 2016: 2397−2406
[84] Fan Chenyou, Zhang Xiaofan, Zhang Shu, et al. Heterogeneous memory enhanced multimodal attention model for video question answering [C]//Proc of the 32nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 1999−2007
[85] Kim J, Ma M, Kim K, et al. Progressive attention memory network for movie story question answering [C]//Proc of the 32nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 8337−8346
[86] Yu Ting, Yu Jun, Yu Zhou, et al. Long-term video question answering via multimodal hierarchical memory attentive networks[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 31(3): 931−944
[87] Fukui A, Park D H, Yang D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding [C]//Proc of the 2020 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2016: 457−468
[88] Kim J H, On K W, Lim W, et al. Hadamard product for low-rank bilinear pooling [C/OL]//Proc of the Int Conf on Learning Representations. 2017 [2022-01-20].https://openreview.net/forum?id= r1rhWnZkg
[89] Wang Zhichun, Lv Qingsong, Lan Xiaohan, et al. Cross-lingual knowledge graph alignment via graph convolutional networks [C]//Proc of the 2018 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2018: 349−357
[90] Fan Wenqi, Ma Yao, Li Qing, et al. Graph neural networks for social recommendation [C]//Proc of the 2019 World Wide Web Conf. New York: ACM, 2019: 417−426
[91] Huang Deng, Chen Peihao, Zeng Runhao, et al. Location-aware graph convolutional networks for video question answering [C]//Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 11021−11028
[92] Jiang Pin, Han Yahong. Reasoning with heterogeneous graph alignment for video question answering [C]//Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 11109−11116
[93] Seo A, Kang G C, Park J, et al. Attend what you need: Motion-appearance synergistic networks for video question answering [C]//Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2021: 6167–6177
[94] Park J, Lee J, Sohn K. Bridge to Answer: Structure-aware graph interaction network for video question answering [C]//Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 15526−15535
[95] Wang Jianyu, Bao Bingkun, Xu Changsheng. DualVGR: A dual-visual graph reasoning unit for video question answering [J]. IEEE Transactions on Multimedia, 2021, 24: 3369−3380
[96] Wang Xiao, Zhu Meiqi, Bo Deyu, et al. AM-GCN: Adaptive multi-channel graph convolutional networks [C]//Proc of the 26th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining. New York: ACM, 2020: 1243−1253
[97] Dang L H, Le T M, Le V, et al. Object-centric representation learning for video question answering [C/OL]//Proc of the 2021 Int Joint Conf on Neural Networks. Piscataway, NJ: IEEE, 2021 [2022-01-22].https://arxiv.org/abs/2104.05166
[98] Jiang Jingjing, Liu Ziyi, Zheng Nanning, et al. LiVLR: A lightweight visual-linguistic reasoning framework for video question answering [J/OL]. IEEE Transactions on Multimedia, 2022 [2022-02-22].https://github.com/jingjing12110/LiVLR-VideoQA
[99] Guo Zhicheng, Zhao Jiaxuan, Jiao Licheng, et al. Multi-scale progressive attention network for video question answering [C]//Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing (Volume 2: Short Papers). Stroudsburg, PA: ACL, 2021: 973−978
[100] Peng Liang, Yang Shuangji, Bin Yi, et al. Progressive graph attention network for video question answering [C]//Proc of the 29th ACM Int Conf on Multimedia. New York: ACM, 2021: 2871−2879
[101] Liu Fei, Liu Jing, Wang Weining, et al. HAIR: Hierarchical visual-semantic relational reasoning for video question answering [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 1698−1707
[102] Yang Zekun, Garcia N, Chu Chenhui, et al. Bert representations for video question answering [C]//Proc of the 2020 IEEE Winter Conf on Applications of Computer Vision. Piscataway, NJ: IEEE, 2020: 1556−1565
[103] Urooj Khan A, Mazaheri A, da Vitoria Lobo N, et al. MMFT-BERT: Multimodal fusion transformer with BERT encodings for visual question answering [C]//Proc of the 2020 Conf on Empirical Methods in Natural Language Processing(Findings). Stroudsburg, PA: ACL, 2020: 4648−4660
[104] Garcia N, Nakashima Y. Knowledge-based video question answering with unsupervised scene descriptions [C]//Proc of the 16th European Conf on Computer Vision. Berlin: Springer, 2020: 581−598
[105] Engin D, Schnitzler F, Duong N Q K, et al. On the hidden treasure of dialog in video question answering [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 2064−2073
[106] Sadhu A, Chen Kan, Nevatia R. Video question answering with phrases via semantic roles [C]//Proc of the 19th Int Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2021: 2460−2478
[107] Ganesan A, Pal D, Muthuraman K, et al. Video based contextual question answering[J]. arXiv preprint, arXiv: 1804. 07399, 2018
[108] Cherian A, Hori C, Marks T K, et al. (2.5+ 1) D spatio-temporal scene graphs for video question answering [C]//Proc of the 36th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2022: 444−453
[109] Peng Min, Wang Chongyang, Gao Yuan, et al. Temporal pyramid transformer with multimodal interaction for video question answering[J]. arXiv preprint, arXiv: 2109. 04735, 2021
[110] Tan Hao, Bansal M. Lxmert: Learning cross-modality encoder representations from transformers [C]//Proc of the 2019 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2019: 5100−5111
[111] Sun Chen, Myers A, Vondrick C, et al. Videobert: A joint model for video and language representation learning [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 7464−7473
[112] Chen Xinlei, Fang Hao, Lin T Y, et al. Microsoft COCO captions: Data collection and evaluation server[J]. arXiv preprint, arXiv: 1504. 00325, 2015
[113] Krishna R, Zhu Yuke, Groth O, et al. Visual Genome: Connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123(1): 32−73 doi: 10.1007/s11263-016-0981-7
[114] Miech A, Zhukov D, Alayrac J B, et al. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 2630−2640
[115] Kim S, Jeong S, Kim E, et al. Self-supervised pre-training and contrastive representation learning for multiple-choice video QA[J]. arXiv preprint, arXiv: 2009. 08043, 2020
[116] Yang A, Miech A, Sivic J, et al. Just ask: Learning to answer questions from millions of narrated videos [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 1686−1697
[117] Li Linjie, Chen Y C, Cheng Yu, et al. HERO: Hierarchical encoder for video+ language omni-representation pre-training [C]//Proc of the 2020 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 2046−2065
[118] Zellers R, Lu Ximing, Hessel J, et al. MERLOT: Multimodal neural script knowledge models [C]//Proc of the 34th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2021: 23634−23651
[119] Liu Yinhan, Ott M, Goyal N, et al. RoBERTa: A robustly optimized bert pretraining approach [C/OL]//Proc of the Int Conf on Learning Representations. 2020 [2022-01-25].https://openreview.net/forum?id= SyxS0T4tvS
[120] Lei Jie, Li Linjie, Zhou Luowei, et al. Less is more: Clipbert for video-and-language learning via sparse sampling [C]//Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 7331−7341
[121] Yu Weijiang, Zheng Haoteng, Li Mengfei, et al. Learning from Inside: Self-driven siamese sampling and reasoning for video question answering [C]//Proc of the 34th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2021: 26462−26474
[122] Amrani E, Ben-Ari R, Rotman D, et al. Noise estimation using density estimation for self-supervised multimodal learning [C]//Proc of the 35th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2021: 6644−6652
[123] Luo Jianjie, Li Yehao, Pan Yingwei, et al. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising [C]//Proc of the 29th ACM Int Conf on Multimedia. New York: ACM, 2021: 5600−5608
[124] Seo P H, Nagrani A, Schmid C. Look before you speak: Visually contextualized utterances [C]//Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 16877−16887
[125] Lu Jiasen, Batra D, Parikh D, et al. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks [C]//Proc of the 32nd Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2019: 13−23
[126] Fu T J, Li Linjie, Gan Zhe, et al. VIOLET: End-to-end video-language transformers with masked visual-token modeling[J]. arXiv preprint, arXiv: 2111. 12681, 2021
[127] Liu Ze, Ning Jia, Cao Yue, et al. Video swin transformer [C]//Proc of the 35th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 3202−3211
[128] Zhou Luowei, Liu Jingjing, Cheng Yu, et al. Cupid: Adaptive curation of pre-training data for video-and-language representation learning[J]. arXiv preprint, arXiv: 2104. 00285, 2021
[129] Le T M, Le V, Venkatesh S, et al. Hierarchical conditional relation networks for video question answering [C]//Proc of the 33rd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 9972−9981
[130] Dang L H, Le T M, Le V, et al. Hierarchical object-oriented spatio-temporal reasoning for video question answering [C]//Proc of the 30th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2021: 636−642
[131] Xiao Junbin, Yao A, Liu Zhiyuan, et al. Video as conditional graph hierarchy for multi-granular question answering [C]//Proc of the 36th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2022: 2804−2812
[132] Yi Kexin, Gan Chuang, Li Yunzhu, et al. CLEVRER: Collision events for video representation and reasoning [C/OL]//Proc of the Int Conf on Learning Representations. 2020 [2022-09-03].https://openreview.net /forum?id=HkxYzANYDB
[133] Chen Zhenfang, Mao Jiayuan, Wu Jiajun, et al. Grounding physical concepts of objects and events through dynamic visual reasoning [C/OL]//Proc of the Int Conf on Learning Representations. 2021 [2022-09-03].https://openreview.net/pdf?id=bhCDO_cEGCz
[134] Ding Mingyu, Chen Zhenfang, Du Tao, et al. Dynamic visual reasoning by learning differentiable physics models from video and language [C]//Proc of the 34th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2021, 34: 887−899
[135] Mao Jiayuan, Gan Chuang, Kohli P, et al. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision [C/OL]//Proc of the Int Conf on Learning Representations. 2019 [2022-09-04].https://research.ibm.com/publications/the-neuro-symbolic-concept-learner-interpreting-scenes-words-and-sentences-from-natural-supervision
[136] Xu Li, Huang He, Liu Jun. SUTD-TrafficQA: A question answering benchmark and an efficient network for video reasoning over traffic events [C]//Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 9878−9888
[137] Garcia N, Otani M, Chu Chenhui, et al. KnowIT VQA: Answering knowledge-based questions about videos [C]//Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 10826−10834
[138] Han Yahong, Wang Bo, Hong Richang, et al. Movie question answering via textual memory and plot graph[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 30(3): 875−887
[139] Maharaj T, Ballas N, Rohrbach A, et al. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering [C]//Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 6884−6893
[140] Mun J, Hongsuck Seo P, Jung I, et al. MarioQA: Answering questions by watching gameplay videos [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 2867−2875
[141] Ates T, Atesoglu M S, Yigit C, et al. CRAFT: A benchmark for causal reasoning about forces and interactions [C]//Proc of the 2022 Findings of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 2602–2627
[142] Yu Zhou, Xu Dejing, Yu Jun, et al. Activitynet-QA: A dataset for understanding complex web videos via question answering [C]//Proc of the 33rd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2019: 9127−9134
[143] Torabi A, Tandon N, Sigal L. Learning language-visual embedding for movie understanding with natural-language[J]. arXiv preprint, arXiv: 1609. 08124, 2016
[144] Fan Chenyou. EgoVQA-an egocentric video question answering benchmark dataset [C]//Proc of the 2019 IEEE Int Conf on Computer Vision Workshops. Piscataway, NJ: IEEE, 2019: 4359−4366
[145] Zadeh A, Chan M, Liang P P, et al. Social-IQ: A question answering benchmark for artificial social intelligence [C]//Proc of the 32nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 8807−8817
[146] Choi S, On K W, Heo Y J, et al. DramaQA: Character-centered video story understanding with hierarchical QA [C]//Proc of the 35th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI 2021: 1166−1174
[147] Castro S, Azab M, Stroud J, et al. LifeQA: A real-life dataset for video question answering [C]//Proc of the 12th Language Resources and Evaluation Conf. Marseille: European Language Resources Association (ELRA), 2020: 4352−4358
[148] Colas A, Kim S, Dernoncourt F, et al. Tutorial-VQA: Question answering dataset for tutorial videos [C]//Proc of the 12th Language Resources and Evaluation Conf. Marseille: European Language Resources Association (ELRA), 2020: 5450–5455
[149] Xiao Junbin, Shang Xindi, Yao A, et al. NExT-QA: Next phase of question-answering to explaining temporal actions [C]//Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 9777−9786
[150] Grunde-McLaughlin M, Krishna R, Agrawala M. AGQA: A benchmark for compositional spatio-temporal reasoning [C]//Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 11287−11297
[151] Wu Bo, Yu Shoubin, Chen Zhenfang, et al. STAR: A benchmark for situated reasoning in real-world videos [C/OL]//Proc of the 35th Conf on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). Cambridge, MA: MIT Press, 2021 [2022-01-25].https://openreview.net/forum?id=EfgNF5-ZAjM
[152] Jia Baoxiong, Lei Ting, Zhu Songchun, et al. EgoTaskQA: Understanding human tasks in egocentric videos [C/OL]//Proc of the 36th Advances in Neural Information Processing Systems Datasets and Benchmarks Track. Cambridge, MA: MIT, 2022 [2022-01-25].https://openreview.net/forum?id=ttxAvIQA4i_
[153] Jasani B, Girdhar R, Ramanan D. Are we asking the right questions in MovieQA? [C]//Proc of the IEEE Int Conf on Computer Vision Workshops. Piscataway, NJ: IEEE, 2019: 1879−1882
[154] Rohrbach A, Torabi A, Rohrbach M, et al. Movie description[J]. International Journal of Computer Vision, 2017, 123(1): 94−120 doi: 10.1007/s11263-016-0987-1
[155] Kolve E, Mottaghi R, Han W, et al. AI2-THOR: An interactive 3D environment for visual AI[J]. arXiv preprint, arXiv: 1712. 05474, 2017
[156] Li Yuncheng, Song Yale, Cao Liangliang, et al. TGIF: A new dataset and benchmark on animated Gif description [C]//Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 4641−4650
[157] Guadarrama S, Krishnamoorthy N, Malkarnenkar G, et al. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2013: 2712−2719
[158] Ji Jingwei, Krishna R, Fei-Fei L, et al. Action genome: Actions as compositions of spatio-temporal scene graphs [C]//Proc of the 33rd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10236−10247
[159] Thomee B, Shamma D A, Friedland G, et al. YFCC-100M: The new data in multimedia research[J]. Communications of the ACM, 2016, 59(2): 64−73 doi: 10.1145/2812802
[160] Sigurdsson G A, Russakovsky O, Gupta A. What actions are needed for understanding human actions in videos? [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 2137−2146
[161] Wang Xin, Wu Jiawei, Chen Junkun, et al. VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research [C]//Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 4581−4591
[162] Jia Baoxiong, Chen Yixin, Huang Siyuan, et al. LEMMA: A multi-view dataset for learning multi-agent multi-task activities [C]//Proc of the 16th European Conf on Computer Vision. Berlin: Springer, 2020: 767−786
[163] Wang Xinyu, Liu Yuliang, Shen Chunhua, et al. On the general value of evidence, and bilingual scene-text visual question answering [C]//Proc of the 33rd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10126−10135
[164] Marino K, Chen Xinlei, Parikh D, et al. Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based VQA [C]//Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 14111−14121
[165] Zhang Yifeng, Jiang Ming, Zhao Qi. Explicit knowledge incorporation for visual reasoning [C]//Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 1356−1365
[166] Wang Peng, Wu Qi, Shen Chunhua, et al. FVQA: Fact-based visual question answering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(10): 2413−2427
[167] Wu Qi, Wang Peng, Shen Chunhua, et al. Ask me anything: Free-form visual question answering based on knowledge from external sources [C]//Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 4622−4630
[168] Marino K, Rastegari M, Farhadi A, et al. Ok-VQA: A visual question answering benchmark requiring external knowledge [C]//Proc of the 32nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 3195−3204
[169] Shah S, Mishra A, Yadati N, et al. KVQA: Knowledge-aware visual question answering [C]//Proc of the 33rd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2019: 8876−8884
[170] Chen Zhou, Chen Jiaoyan, Geng Yuxia, et al. Zero-shot visual question answering using knowledge graph [C]//Proc of the 20th Int Semantic Web Conf. Berlin: Springer, 2021: 146−162
[171] Wu Jialin, Lu Jiasen, Sabharwal A, et al. Multi-modal answer validation for knowledge-based VQA [C]//Proc of the 36th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2022: 2712−2721
[172] Wu Qi, Shen Chunhua, Wang Peng, et al. Image captioning and visual question answering based on attributes and external knowledge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(6): 1367−1381
[173] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision [C]//Proc of the 38th Int Conf on Machine Learning. New York: ACM, 2021: 8748−8763
[174] Ju Chen, Han Tengda, Zheng Kunhao, et al. Prompting visual-language models for efficient video understanding [C]//Proc of the 17th European Conf on Computer Vision. Berlin: Springer, 2022: 105−124
-
期刊类型引用(16)
1. 戎珂,施新伟,吕若明. “i7算”赋能AI产业生态可持续发展. 科学学研究. 2025(01): 197-204 . 百度学术
2. 张浩严,吕文涛,余润泽,邓志江. 大语言模型研究现状. 无线电工程. 2025(01): 163-174 . 百度学术
3. 李东闻,钟震宇,孙羽菲,申峻宇,马子智,于川越,张玉志. 玲珑:一个小规模的高质量中文预训练语言模型. 计算机研究与发展. 2025(03): 682-693 . 本站查看
4. 陶江垚,奚雪峰,盛胜利,崔志明,左严. 结构化思维提示增强大语言模型推理能力综述. 计算机工程与应用. 2025(06): 64-83 . 百度学术
5. 魏楚元,王昕,周小平,赵光哲,黄明. 大型语言模型及其在建筑行业应用研究综述. 北京建筑大学学报. 2024(02): 1-14+121 . 百度学术
6. 庞进喜. 大模型在汽车国际化多语言处理中的应用. 中国汽车. 2024(05): 14-20 . 百度学术
7. 王晓璐,杨云轩,谢阳斌. 创造人机对话式学习新形态——大语言模型的教育应用现状与展望. 中小学信息技术教育. 2024(05): 15-17 . 百度学术
8. 马伟民. 自然语言大模型技术在政务服务智能客服系统建设中的应用. 信息与电脑(理论版). 2024(08): 86-88 . 百度学术
9. 曾白凌. “被中介的真理”:Sora对媒介相合性的追问. 现代传播(中国传媒大学学报). 2024(05): 1-10 . 百度学术
10. 童俊杰,申佳,赫罡,张奎. 运营商智算中心建设思路及方案. 邮电设计技术. 2024(09): 68-73 . 百度学术
11. 刘同军. 生成式人工智能革新数学教学:场景与案例. 中学数学杂志. 2024(10): 1-4 . 百度学术
12. 尹为民. 一种基于预训练模型的类增量学习近似重放方法分析. 电子技术. 2024(10): 144-145 . 百度学术
13. 崔金满,李冬梅,田萱,孟湘皓,杨宇,崔晓晖. 提示学习研究综述. 计算机工程与应用. 2024(23): 1-27 . 百度学术
14. 王珍珍,向巴卓玛,赵岩松,马星光. 以ChatGPT为代表的大型语言模型在医学教学中的应用. 医学教育管理. 2024(06): 692-697 . 百度学术
15. 王琳. 大语言模型技术背景下重塑研究生论文评价与指导. 学位与研究生教育. 2024(12): 30-37 . 百度学术
16. 朱俊仪,朱尚明. 利用检索增强生成技术开发本地知识库应用. 通信学报. 2024(S2): 242-247 . 百度学术
其他类型引用(7)