A Unified Framework Based on Multimodal Aspect-Term Extraction and Aspect-Level Sentiment Classification
-
摘要:
通过方面术语提取和方面级情感分类任务提取句子中的方面-情感对,有助于Twitter,Facebook等社交媒体平台挖掘用户对不同方面的情感,对个性化推荐有重要的意义. 在多模态领域,现有方法使用2个独立的模型分别完成2个子任务,方面术语提取提取句子中包含的商品、重要人物等实体或实体的方面,方面级情感分类根据给定的方面术语预测用户的情感倾向. 上述方法存在2个问题:1)使用2个独立的模型丢失了2个任务之间在底层特征的延续性,无法建模句子潜在的语义关联;2)方面级情感分类1次预测1个方面的情感,与方面术语提取同时提取多个方面的吞吐量不匹配,且2个模型串行执行使得提取方面-情感对的效率低. 为解决这2个问题,提出基于多模态方面术语提取和方面级情感分类的统一框架UMAS. 首先,建立共享特征模块,实现任务间潜在语义关联建模,并且共享表示层使得2个子任务只需关心各自上层的网络,降低了模型的复杂性;其次,模型利用序列标注同时输出句子中包含的多个方面及其对应的情感类别,提高了方面-情感对的提取效率. 此外,在这2个子任务中同时引入词性:利用其中蕴含的语法信息提升方面术语提取的性能;通过词性获取观点词信息,提升方面级情感分类的性能. 实验结果表明,该统一框架在Twitter2015,Restaurant2014这2个基准数据集上相比于多个基线模型具有优越的性能.
-
关键词:
- 方面术语提取(AE) /
- 方面级情感分类(ALSC) /
- 统一框架 /
- 共享特征表示 /
- 序列标注
Abstract:Aspect-term extraction (AE) and aspect-level sentiment classification (ALSC) extract aspect-sentiment pairs in the sentence, which helps social media platforms such as Twitter and Facebook to mine users’ sentiments of different aspects and is of great significance to personalized recommendation. In the field of multimodality, the existing method uses two independent models to complete two subtasks respectively. Aspect-term extraction identifies goods, important people and other entities or entities’ aspects in the sentence, and aspect-level sentiment classification predicts the user’s sentiment orientation according to the given aspect terms. There are two problems in the above method: first, using two independent models loses the continuity of the underlying features between the two tasks, and cannot model the potential semantic association of sentences; second, aspect-level sentiment classification can only predict the sentiment of one aspect at a time, which does not match the throughput of aspect-term extraction model that extracts multiple aspects simultaneously, and the serial execution of the two models makes the efficiency of extracting aspect-sentiment pairs low. To solve the above problems, a unified framework based on multimodal aspect-term extraction and aspect-level sentiment classification, called UMAS, is proposed in this paper. Firstly, the shared feature module is built to realize the latent semantic association modeling between tasks, and to make the two subtasks only need to care about their upper network, which reduces the complexity of the model. Secondly, multiple aspects and their corresponding sentiment categories in the sentence are output at the same time by using sequence tagging, which improves the extraction efficiency of aspect-sentiment pairs. In addition, we introduce part of speech in two subtasks at the same time: using the grammatical information to improve the performance of aspect-term extraction, and the information of opinion words is obtained through part of speech to improve the performance of aspect-level sentiment classification. The experimental results show that the unified model has superior performance compared with multiple baseline models on the two benchmark datasets of Twitter2015 and Restaurant2014.
-
点对学习(pairwise learning)在数据挖掘和机器学习中占有很重要的地位. 在数据挖掘方面,主要的应用场景有运营式传统行业、互联网领域、物联网领域等[1];在机器学习方面,包括排序[2-5]、接收机操作特性曲线的面积计算[6]、度量学习[7]等.
关于在线点对学习泛化性的研究,Wang等人[8]建立的损失函数是一致有界条件下的泛化分析,给出遗憾界O(T−1). Ying等人[9-10]基于非正则的再生核希尔伯特空间(reproducing kernel Hilbert space, RKHS)的在线点对学习算法,在损失函数没有强凸性和有界性的假设条件下,得到遗憾界O(T−1/3logT). Chen等人[11]假设迭代序列满足更紧的一致约束,给出遗憾界O(T−1/2),同时提高了算法最后一次迭代的收敛率. Guo等人[12]基于正则的RKHS,对于特定的铰链损失函数有O(T−1/4logT1/2)收敛率. Wang等人[13]分析了具有多项式衰减步长和多个正则化参数的在线点对学习算法最后一次迭代的收敛性,给出遗憾界O(T−2/3). 文献[14]提出了基于分而治之策略的多正则化项的分布式在线点对学习的误差分析.
作为用来分析泛化性的重要工具之一,稳定性分析已经被广泛应用于单点学习算法的泛化边界研究[15-16]. 但是,除了Shen等人[17]和Lei等人[18]的工作以外,关于点对学习的稳定性研究较少. Shen等人[17]建立了随机梯度下降(stochastic gradient descent, SGD)在凸和强凸损失函数中的稳定性结果,从而给出了泛化边界,并权衡了SGD下的泛化误差和优化误差,Lei等人[18]提供了一个更细化的稳定性分析,并将泛化边界改进为O(γlogn).
以上所述都是假设损失函数是强凸或一般凸,缺乏对非凸损失函数的在线点对学习的理论分析. 针对这一问题,本文基于稳定性分析,提出了非凸损失函数的在线点对学习的遗憾界. 本文包含2点贡献:1)提出了可以扩展到非凸损失函数的广义在线点对学习框架;2)建立了该框架下稳定性和遗憾界之间的关系,并给出了该框架的遗憾界及理论分析.
1. 广义在线点对学习框架
1.1 广义在线点对学习
传统的机器学习与在线学习模式不同,传统的机器学习中,一次性地取全部训练样本进行学习,样本分为训练样本和测试样本. 而在线学习没有训练样本和测试样本的区分,学习者(learner)实时获取样本,每获取到1个实例,就调整1次假设.
假设一个样本空间Z=X×Y,其中X⊂Rd是一个输入空间,Y⊂R是一个输出空间,{{\boldsymbol{x}}} \in X和y \in Y分别是输入和输出. 在线点对学习的学习过程迭代T轮,学习者每轮接收2个实例({{\boldsymbol{x}}_t},{y_t}),({\boldsymbol{x}}_t^\prime ,y_t^\prime ),并进行预测,然后接收标签,再依损失函数 {\ell _t} \in L 遭受的损失更新假设{{\boldsymbol{w}}_{t + 1}} \in W.
广义在线点对学习表示为
{{\boldsymbol{w}}_t} \in {{\rm{arg\;min}}} \left\{ {\sum\limits_{i = 1}^{t - 1} {{\ell _i}} ({{\boldsymbol{w}}},{z},{{z}^\prime }) + {{{\boldsymbol{\sigma}} }^{\text{T}}}{{\boldsymbol{w}}}:{{\boldsymbol{w}}} \in W} \right\}. 显而易见,与在线点对学习模型相比,广义在线点对学习是一个更鲁棒、更广义的框架. 该框架中包含一个{{\boldsymbol{\sigma }}}项,{{\boldsymbol{\sigma }}} 项是一个从{\text{exp}}(\eta )(参数为\eta 的指数分布)采样的随机噪声. 引入随机噪声项{{\boldsymbol{\sigma }}}避免过拟合,从而提高在线点对学习的泛化能力. 因此,广义在线点对学习是鲁棒的在线点对学习,是一个更广义的在线框架. FTRL(follow the regularized leader)是一种广泛用于凸假设的非常有效的算法[19]. 事实上,当广义在线点对学习中的随机噪声为\mu {{\boldsymbol{w}}}时,FTRL即为广义在线点对学习的一个特例:
\begin{gathered} {{\boldsymbol{w}}_t} \in {{\rm{arg\;min}}} \left\{ {\sum\limits_{i = 1}^{t - 1} {{\ell _i}} ({\boldsymbol{w}},{z},{{z}^\prime }) + \mu {{\left\| {\boldsymbol{w}}\right\|}^2}} \right\}, \\ \mu \approx {T^\rho },\rho \in (0,1). \\ \end{gathered} 1.2 离线神谕模型
直觉上,度量在线学习质量的方式是学习者遭受的损失函数的累计加和,学习者的目标是使累计损失尽可能的小,这也是一般学习模式性能度量的方式. 但是,在线学习中的数据通常是对抗性生成的,对手知道学习者的所有信息,包括学习者给出的预测函数和损失函数等. 因此,对手总是会给出与学习者预测值相反的标签,使得学习者的预测总是错误的,并遭受最大的损失. 在这种对抗环境中,累计损失的度量方式难以奏效,为了解决这个问题,引入了专家(expert)这一概念. 专家就是一个映射集合 h:X \to Y ,对输入{\boldsymbol{x}} \in X给出预测h({\boldsymbol{x}}),与学习者不同的是,专家是固定的,学习者会随着时间根据损失进行更新,而专家给出的预测不会受到对手影响. 采用专家的损失作为参考,矫正学习者的预测,学习者在专家中选择的时候,只需要选择对输入实例的累计损失函数值最小的专家,即最优专家. 有了专家的参与,在线学习性能度量方式就是学习者遭受的损失与最优专家的损失之差的累计加和,以此避免对手的干扰.
有专家建议的在线点对学习是学习者和对手之间的重复游戏,其特征是学习者可以从含有N个专家的有限集合H中进行选择,对手从一组决策集合Y中选择,还有一个损失函数{\ell _t} \in L. 首先,在游戏开始前,对手从Y中选择一个任意的决策序列{y_1},{y_2}, …,在游戏的每一轮t = 1,2, …,学习者必须选择(可能是随机的)一个专家 {h_t} \in H ,然后对手揭示其决策{y_t} \in Y,学习者遭受损失. 学习者每接收到一对实例{z},{{z}^\prime } \in Z后的目标是使学习者遭受的损失与最优假设{{\boldsymbol{w}}^*} \in W的损失之间的累积差尽可能的小,因此,遗憾界是衡量在线点对学习性能的标准,对于算法\mathcal{A}定义为
{\mathcal{R}}_{\mathcal{A}}(T)={\displaystyle \sum _{t=1}^{T}\left[{\ell }_{t}\left({{\boldsymbol{w}}}_{t},{z},{{z}}^{\prime }\right)-{\ell }_{t}\left({{\boldsymbol{w}}}^{*},{z},{{z}}^{\prime }\right)\right]}. 基于神谕的学习模型可称之为“可优化的专家”,该模型本质上是经典的在线学习模型,即学习者通过专家的建议和一个离线神谕进行预测. 在“可优化的专家”模型中,假设最初学习者对损失函数\ell 是未知的,并允许学习者通过神谕来获取\ell . 离线神谕模型是通过学习者提交一系列的损失函数,返回使得累积损失最小的假设. 定义1给出离线神谕模型的一种近似神谕模型定义.
定义1.[20] 一个离线神谕模型,其输入是一个损失函数 \ell :W \to \mathbb{R} 和一个d维向量{{\boldsymbol{\sigma}} },输出是一个{{\boldsymbol{w}}} \to \ell ({\boldsymbol{w}}, {z},{{z}^\prime }) - \langle {{\boldsymbol{\sigma}} },{\boldsymbol{w}}\rangle的近似最小值. 若它返回的{{\boldsymbol{w}}^*} \in W满足
\begin{split}&\ell \left({{\boldsymbol{w}}}^{*},{z},{{z}}^{\prime }\right)-\langle {\boldsymbol{\sigma}} ,{{\boldsymbol{w}}}^{*}\rangle \le \\& \underset{{\boldsymbol{w}}\in W}{\mathrm{inf}}[\ell ({\boldsymbol{w}},{z},{{z}}^{\prime })-\langle {\boldsymbol{\sigma}} ,{\boldsymbol{w}}\rangle ]+\left(\alpha +\beta {\Vert {\boldsymbol{\sigma}} \Vert }_{1}\right),\end{split} 则称其为“(\alpha ,\beta )-近似神谕”.
为 方 便 起 见,将 一 个 “(\alpha ,\beta )-近 似 神 谕” 记 为{{ORA}}{{{L}}_{\alpha ,\beta }} (\ell - \langle {{\boldsymbol{\sigma}} }, \cdot \rangle ). 使用近似神谕是因为我们不可能知道一个假设是全局最小值还是仅为一个鞍点. {{ORAL}}可以输出一个{{\boldsymbol{w}}}而不需要保证其最优性. 在大多数情况下,{\boldsymbol{w}}实际上只是近似最优. 离线神谕模型 (返回{{\boldsymbol{w}}^*} \in \arg \; \min \ell ({\boldsymbol{w}}))与(\alpha ,\beta )-近似神谕模型的区别在于,近似神谕模型有一个关于变量\alpha ,\beta ,{{\boldsymbol{\sigma}} }的扰动项. 在指定变量\alpha 和\beta 的大小时可以获得更好的遗憾界. 近似神谕模型关于在线单点学习的应用包括的实例有:当Query-GAN算法采用近似神谕时,对于非凸损失函数以{T^{ - 1/2}}的速度收敛[21],\alpha = \dfrac{{\hat R + {R_d}(T)}}{T}(\hat R为累积遗憾);FTRL和FTL(follow the leader)算法采用近似神谕时,对于半凸损失函数可以达到{T^{ - 1}}的遗憾界[22],\alpha = {T^{ - 1/2}}. 在上述应用中,\beta {\text{ = }}0,只有\alpha 被用作近似神谕的参数.
2. 非凸广义在线点对学习的稳定性分析
2.1 非凸广义在线点对学习算法
非凸广义在线点对学习算法的原理是迭代T轮,每轮产生一个随机向量{{\boldsymbol{\sigma }}}(该向量服从关于参数\eta 的指数分布),并获得一个假设{{\boldsymbol{w}}},该假设来自于(\alpha ,\beta )-近似离线神谕模型,然后,学习者会遭受相应损失并调整假设. 算法1给出当学习者可以访问(\alpha ,\beta )-近似离线神谕的非凸广义在线点对学习的算法.
算法1.非凸的广义在线点对学习算法.
输入:参数\eta ,近似神谕{{ORAL}_{\alpha ,\beta }};
输出:{{{\boldsymbol{w}}}^*} \in W.
① for t = 1,2, … ,T do
② \left\{ {{\sigma _{t,j}}} \right\}_{j = 1}^d\mathop \sim \limits^{{\rm{i}}{\rm{.i}}{\rm{.d}}} \exp (\eta );/*生成随机向量{{\boldsymbol{\sigma }}_t}*/
③ 学习者在时刻t的假设:
④ {{{\boldsymbol{w}}}_t} = {ORAL}_{\alpha ,\beta }\left( {\displaystyle \sum\limits_{i = 1}^{t - 1} {{\ell _i}} - \left\langle {{{{\boldsymbol{\sigma}} }_t}, \cdot } \right\rangle } \right);
⑤ 学习者遭受损失{\ell _t};
⑥ end for
与在线点对学习相比,广义在线点对学习中包含一个{\boldsymbol{\sigma }}项,是一个具有更强鲁棒性、更广义的算法. 一些在线学习算法,如在线镜像下降(online mirror descent,OMD)[23]、 在线梯度下降(online gradient decent ,OGD) [24]、FTL、FTRL等,通常需要在凸甚至强凸的条件下实现收敛. 文献[20]通过损失函数的随机扰动来保证遗憾消失,这种随机扰动具有与FTRL和OMD中使用的显式正则项类似的作用,将广义在线单点学习算法扩展到非凸设置中,然而缺少关于非凸在线点对学习的内容. 针对这一问题,本文将单点学习扩展到点对设置中,通过稳定性分析将在线点对中的点对耦合进行解耦,从形式上把具有耦合的点对通过稳定性分析变换成2步假设之间的差,从而实现单点到点对学习的推广.
2.2 稳定性分析
算法稳定性是机器学习理论分析中一个重要的概念. 稳定性可以衡量一个学习算法的输出对训练数据集微小变化的敏感性. 对于批量学习假设中独立同分布的样本,稳定性是可学习性的一个关键特性. 类似地,对于在线学习的可学习性,稳定性条件也是同样有效的. 一种特别常用的稳定性度量方式是一致稳定性,已经被广泛应用到在线点对学习中,除此以外,还有由定义2给出的平均稳定性. 下文中用{{\boldsymbol{a}}^{\text{T}}}{\boldsymbol{b}}或\langle {\boldsymbol{a}},{\boldsymbol{b}}\rangle表示{\boldsymbol{a}}和{\boldsymbol{b}}之间的欧几里得点积. 用 \Vert \cdot \Vert 表示2范数, \Vert \cdot {\Vert }_{p} 来表示一个特定的{\ell _p}范数. 若 \left| {f(x) - f(y)} \right| \leqslant G\left\| {x - y} \right\|,\forall x,y \in \mathcal{C} ,则称f:\mathcal{C} \to \mathbb{R}是关于 \Vert \cdot \Vert 范数 G -Lipschitz连续的.
定义2.[18,25] 假设学习者遭受的损失序列是G-Lipschitz连续的. 若 \exists \gamma \gt 0 使得
\frac{1}{T}{\displaystyle \sum _{t=1}^{T}{{{E}}}}\Vert {\ell }_{t}\left({{\boldsymbol{w}}}_{t},{\textit{z}},{{\textit{z}}}^{\prime }\right)-{\ell }_{t+1}\left({{\boldsymbol{w}}}_{t+1},{\textit{z}},{{\textit{z}}}^{\prime }\right)\Vert \le G\gamma \text{,} (1) 则称算法\mathcal{A}是\gamma -平均稳定的.
显然,平均稳定性比一致稳定性(||{\ell _t}\left( {{{{\boldsymbol{w}}}_t},z,{z^\prime }} \right) -{\ell _{t + 1}}\left( {{{{\boldsymbol{w}}}_{t + 1}},{z},{z^\prime }} \right)|| \leqslant G\gamma)更弱,因为前者要求的条件仅仅是{{{\boldsymbol{w}}}_t}的期望值和平均值满足不等式(1). 本文主要研究平均稳定性,所有定理采用的也是平均稳定性,以此放松理论分析中的假设条件.
稳定性分析首先将广义在线点对学习的期望遗憾与平均稳定性建立联系. 如定理1所示,构建了广义在线点对学习的期望遗憾与平均稳定性的相关性.
定理1. 假设D为决策集W \subseteq {\mathbb{R}^d}的界,学习者遭受的损失满足{\ell _1}范数的G-Lipschitz连续性. 则广义在线点对学习的遗憾上界可以表示为
\begin{split} & \frac{1}{T}{E}\left[{\displaystyle \sum _{t=1}^{T}{\ell }_{t}}\left({{\boldsymbol{w}}}_{t},z,{z}^{\prime }\right)-\underset{{\boldsymbol{w}}\in W}{\mathrm{inf}}{\displaystyle \sum _{t=1}^{T}{\ell }_{t}}({\boldsymbol{w}},z,{z}^{\prime })\right]\le \\ & \dfrac{G}{T}{\displaystyle \sum _{t=1}^{T}\underset{\text{Stability }}{\underbrace{{E}\left[{\Vert {{\boldsymbol{w}}}_{t}-{{\boldsymbol{w}}}_{t+1}\Vert }_{1}\right]}}}+\dfrac{d(\beta T+D)}{\eta T}+\alpha . \end{split} (2) 证明. 在“遗忘的对手”模型中,假设对手的决策{\text{\{ }}{\ell _t}{\text{\} }}_{t = 1}^T与广义在线点对学习算法的假设\{ {{\boldsymbol{w}}_t}\} _{t = 1}^T无关,假设损失函数的序列{\text{\{ }}{\ell _t}{\text{\} }}_{t = 1}^T是预先固定的. 而在“非遗忘的对手”模型中,对手的决策取决于算法过去的假设,即每个{\ell _t}由来自某函数 {L_t}:{W^{t - 1}} \to F 的{\ell _t}: = {L_t}[{{\boldsymbol{w}}_{ \lt t}}]给出,其中 F 是对手所有可能决策的集合,{{\boldsymbol{w}}_{ \lt t}}是{{\boldsymbol{w}}_1}, {{\boldsymbol{w}}_2}, … ,{{\boldsymbol{w}}_{t - 1}}的缩写,{L_t}是一个常数函数,其中函数 {L_1},{L_2}, … ,{L_T}决定了一个非遗忘的对手. 令{P_t}是广义在线点对学习基于之前的假设{{\boldsymbol{w}}_{ \lt t}}给出的假设{{\boldsymbol{w}}_t}的条件分布,若在“遗忘的对手”中,{P_t}独立于{{\boldsymbol{w}}_{ \lt t}}. 但无论是在“遗忘的对手”模型或是“非遗忘的对手”模型中,{P_t}都完全由对手之前的决策{\ell _{ \lt t}}决定. 任何对“遗忘的对手”有界的算法对“非遗忘的对手”也有界[26],令B是一个正常数,假设广义在线点对学习满足对“遗忘的对手”的遗憾约束{E}\left[ {\displaystyle \sum\limits_{t = 1}^T {{\ell _t}({{\boldsymbol{w}}_t})} - \mathop {\inf }\limits_{{{{\boldsymbol{w}}}} \in W} \displaystyle \sum\limits_{t = 1}^T {{\ell _t}({\boldsymbol{w}})} } \right] \leqslant B, \forall {\ell _1}, {\ell _2}, … ,{\ell _T} \in F,那么它也满足对“非遗忘的对手”的遗憾约束\displaystyle \sum\limits_{t = 1}^T {{\ell _t}({P_t})} - \mathop {\inf }\limits_{{\boldsymbol{w}} \in W} \displaystyle \sum\limits_{t = 1}^T {{\ell _t}({\boldsymbol{w}})} \leqslant B.
本文研究“遗忘的对手”设定,因此只需使用单一的随机向量{\boldsymbol{\sigma }},而不是在每次迭代中生成一个新的随机向量. {{\boldsymbol{w}}_t}({\boldsymbol{\sigma }})为非凸广义在线点对学习基于随机扰动{\boldsymbol{\sigma }},在第t次迭代时的假设.
对于任意{{\boldsymbol{w}}^*} \in W,都有
\begin{array}{l} \displaystyle \sum\limits_{t = 1}^T {\left[ {{\ell _t}\left( {{{\boldsymbol{w}}_t},z,{z^\prime }} \right) - {\ell _t}\left( {{{\boldsymbol{w}}^*},z,{z^\prime }} \right)} \right]} = \\ \displaystyle \sum\limits_{t = 1}^T {\left[ {{\ell _t}\left( {{{\boldsymbol{w}}_t},z,{z^\prime }} \right) - {\ell _t}\left( {{{\boldsymbol{w}}_{t + 1}},z,{z^\prime }} \right)} \right]} + \\ \displaystyle \sum\limits_{t = 1}^T {{\ell _t}\left( {{{\boldsymbol{w}}_{t + 1}},z,{z^\prime }} \right) - {\ell _t}\left( {{{\boldsymbol{w}}^*},z,{z^\prime }} \right)}\leqslant \\ \displaystyle \sum\limits_{t = 1}^T G \left\| {{{\boldsymbol{w}}_t} - {{\boldsymbol{w}}_{t + 1}}} \right\|_{1} + \\ \displaystyle \sum\limits_{t = 1}^T {\left[ {{\ell _t}\left( {{{\boldsymbol{w}}_{t + 1}},z,{z^\prime }} \right) - {\ell _t}\left( {{{\boldsymbol{w}}^*},z,{z^\prime }} \right)} \right]}\;\; . \end{array} 令\gamma ({\boldsymbol{\sigma }}) = \alpha + \beta {\left\| {\boldsymbol{\sigma }} \right\|_1}. 由归纳法证明
\begin{split} & {\displaystyle \sum _{t=1}^{T}\left[{\ell }_{t}\left({{\boldsymbol{w}}}_{t+1},z,{z}^{\prime }\right)-{\ell }_{t}\left({{\boldsymbol{w}}}^{*},z,{z}^{\prime }\right)\right]}\le \\ & \gamma ({\boldsymbol{\sigma}} )T+\langle {\boldsymbol{\sigma }},{{\boldsymbol{w}}}_{2}-{{\boldsymbol{w}}}^{*}\rangle . \end{split} 步骤T= 1:由于{{\boldsymbol{w}}_2}是{\ell _1}({\boldsymbol{w}},z,{z^\prime }) - \langle {\boldsymbol{\sigma }},{\boldsymbol{w}}\rangle的近似最小化,因此
\begin{split} & {\ell }_{1}\left({\boldsymbol{w}}_{2},z,{z}^{\prime }\right)-\langle \boldsymbol{\sigma} ,{\boldsymbol{w}}_{2}\rangle \le \\ & \underset{{\boldsymbol{w}}\in W}{\mathrm{min}}{\ell }_{1}(\boldsymbol{w},z,{z}^{\prime })-\langle \boldsymbol{\sigma} ,\boldsymbol{w}\rangle +\gamma (\boldsymbol{\sigma} )\le \\ & {\ell }_{1}\left({\boldsymbol{w}}^{*},z,{z}^{\prime }\right)-\langle \boldsymbol{\sigma} ,{\boldsymbol{w}}^{*}\rangle +\gamma (\boldsymbol{\sigma} ). \end{split} (3) 式(3)中最后一个不等式对任何{{\boldsymbol{w}}^*} \in W均成立. 即
{\ell _1}\left( {{{\boldsymbol{w}}_2},z,{z^\prime }} \right) - {\ell _1}\left( {{{\boldsymbol{w}}^*},z,{z^\prime }} \right) \leqslant \gamma ({\boldsymbol{\sigma }}) + \left\langle {{\boldsymbol{\sigma }},{{\boldsymbol{w}}_2} - {{\boldsymbol{w}}^*}} \right\rangle . 归纳步骤:假设归纳对所有T \leqslant {T_0} - 1成立,下面证明它对{T_0}也成立.
\begin{split} & {\displaystyle \sum _{t=1}^{{T}_{0}}{\ell }_{t}\left({\boldsymbol{w}}_{t+1},z,{z}^{\prime }\right)}\stackrel{\textcircled{\scriptsize{1}}}{\le }\\ & \left[\begin{array}{c}{\displaystyle \sum _{t=1}^{{T}_{0}-1}{\ell }_{t}}\left({\boldsymbol{w}}_{{T}_{0}+1},z,{z}^{\prime }\right)+\\ \langle \boldsymbol{\sigma} ,{\boldsymbol{w}}_{2}-{\boldsymbol{w}}_{{T}_{0}+1}\rangle +\gamma (\boldsymbol{\sigma} )\left({T}_{0}-1\right)\end{array}\right]+{\ell }_{{T}_{0}}\left({\boldsymbol{w}}_{{T}_{0}+1},z,{z}^{\prime }\right)=\\ & \left[\begin{array}{l}{\displaystyle \sum _{t=1}^{{T}_{0}}{\ell }_{t}}\left({\boldsymbol{w}}_{{T}_{0}+1},z,{z}^{\prime }\right)-\\ \langle \boldsymbol{\sigma} ,{\boldsymbol{w}}_{{T}_{0}+1}\rangle \end{array}\right]+\langle \boldsymbol{\sigma} ,{\boldsymbol{w}}_{2}\rangle +\gamma (\boldsymbol{\sigma} )\left({T}_{0}-1\right)\stackrel{\textcircled{\scriptsize{2}}}{\le }\\ & {\displaystyle \sum _{t=1}^{{T}_{0}}{\ell }_{t}}\left({\boldsymbol{w}}^{*},z,{z}^{\prime }\right)+\langle \boldsymbol{\sigma} ,{\boldsymbol{w}}_{2}-{\boldsymbol{w}}^{*}\rangle +\gamma (\boldsymbol{\sigma} ){T}_{0},\forall {\boldsymbol{w}}^{*}\in {W},\end{split} 其中①处是由于归纳对任何T \leqslant {T_0} - 1都成立,而②处是由于{{\boldsymbol{w}}_{{T_0} + 1}}的近似最优化.
由上述结果,得到非凸的广义在线点对学习的期望遗憾上界:
\begin{split} {E} & \left[ {\sum\limits_{t = 1}^T {{\ell _t}} \left( {{{\boldsymbol{w}}_t},z,{z^\prime }} \right) - \mathop {\inf }\limits_{{\boldsymbol{w}} \in W} \sum\limits_{t = 1}^T {{\ell _t}} ({\boldsymbol{w}},z,{z^\prime })} \right] \leqslant \\ & G\sum\limits_{t = 1}^T {E} \left[ {{{\left\| {{{\boldsymbol{w}}_t} - {{\boldsymbol{w}}_{t + 1}}} \right\|}_1}} \right] + {E}\left[ {\gamma ({\boldsymbol{\sigma }})T + \left\langle {{\boldsymbol{\sigma }},{{\boldsymbol{w}}_2} - {{\boldsymbol{w}}^*}} \right\rangle } \right] \leqslant \\ & G\sum\limits_{t = 1}^T {E} \left[ {{{\left\| {{{\boldsymbol{w}}_t} - {{\boldsymbol{w}}_{t + 1}}} \right\|}_1}} \right] + (\beta T + D)\left( {\sum\limits_{i = 1}^d {E} \left[ {{{\sigma} _i}} \right]} \right) + \alpha T, \end{split} 由指数分布的属性{E}\left[ {{\sigma _i}} \right] = \dfrac{1}{{{\eta _i}}},得证.证毕.
定理1表明期望遗憾与平均稳定性相关. 式(2)中的稳定性即为定义2中的平均稳定性. 定理1的证明是受文献[26]的启发,证明了当平均稳定性的上界可得时,遗憾也可以实现收敛. 因此,定理2将着重于研究稳定项{E}\left[ {{{\left\| {{{\boldsymbol{w}}_t} - {{\boldsymbol{w}}_{t + 1}}} \right\|}_1}} \right]的上界.
定理2. {{\boldsymbol{w}}_t}({\boldsymbol{\sigma }})为广义在线点对学习在第t次迭代时的假设,其中,{\boldsymbol{\sigma }}为随机扰动. {{\boldsymbol{e}}_i}表示第i个标准基向量,{{\boldsymbol{w}}_{t,i}}表示{{\boldsymbol{w}}_t}在i坐标上的假设. 对于任意c \gt 0,都有单调性成立:
{{\boldsymbol{w}}}_{t,i}\left(\boldsymbol{\sigma} +c{{\boldsymbol{e}}}_{i}\right)\ge {{\boldsymbol{w}}}_{t,i}(\boldsymbol{\sigma} )-\frac{2\left(\alpha +\beta {\Vert \boldsymbol{\sigma} \Vert }_{1}\right)}{c}-\beta . 证明. 令{\ell _{1:t}}({\boldsymbol{w}},z,{z^\prime }) = \displaystyle \sum\limits_{i = 1}^t {{\ell _i}} ({\boldsymbol{w}},z,{z^\prime }),{{\boldsymbol{\sigma }}^\prime } = {\boldsymbol{\sigma }} + c{{\boldsymbol{e}}_i},\gamma ({\boldsymbol{\sigma }}) = \alpha + \beta {\left\| {\boldsymbol{\sigma }} \right\|_1}为离线神谕的近似误差. 由{{\boldsymbol{w}}_t}({\boldsymbol{\sigma }})的近似最优化,得
\begin{split}& {\ell _{1:t - 1}}\left( {{{\boldsymbol{w}}_t}({\boldsymbol{\sigma }}),z,{z^\prime }} \right) - \left\langle {{\boldsymbol{\sigma }},{{\boldsymbol{w}}_t}({\boldsymbol{\sigma }})} \right\rangle \le \\ & {\ell _{1:t - 1}}\left( {{{\boldsymbol{w}}_t}\left( {{{\boldsymbol{\sigma }}^\prime }} \right),z,{z^\prime }} \right) - \left\langle {{\boldsymbol{\sigma }},{{\boldsymbol{w}}_t}\left( {{{\boldsymbol{\sigma }}^\prime }} \right)} \right\rangle + \gamma ({\boldsymbol{\sigma }}) \stackrel{\textcircled{\scriptsize{1}}} = \\& {\ell _{1:t - 1}}\left( {{{\boldsymbol{w}}_t}\left( {{{\boldsymbol{\sigma }}^\prime }} \right),z,{z^\prime }} \right) - \left\langle {{{\boldsymbol{\sigma }}^\prime },{{\boldsymbol{w}}_t}\left( {{{\boldsymbol{\sigma }}^\prime }} \right)} \right\rangle + \\& c{{\boldsymbol{w}}_{t,i}}\left( {{{\boldsymbol{\sigma }}^\prime }} \right) + \gamma ({\boldsymbol{\sigma }})a \le \\& {\ell _{1:t - 1}}\left( {{{\boldsymbol{w}}_t}({\boldsymbol{\sigma }}),z,{z^\prime }} \right) - \left\langle {{{\boldsymbol{\sigma }}^\prime },{{\boldsymbol{w}}_t}({\boldsymbol{\sigma }})} \right\rangle + \\& c{{\boldsymbol{w}}_{t,i}}\left( {{{\boldsymbol{\sigma }}^\prime }} \right) + \gamma ({\boldsymbol{\sigma }}) + \gamma \left( {{{\boldsymbol{\sigma }}^\prime }} \right) = \\& {\ell _{1:t - 1}}\left( {{{\boldsymbol{w}}_t}({\boldsymbol{\sigma }}),z,{z^\prime }} \right) - \left\langle {{\boldsymbol{\sigma }},{{\boldsymbol{w}}_t}({\boldsymbol{\sigma }})} \right\rangle + \\& c\left( {{{\boldsymbol{w}}_{t,i}}\left( {{{\boldsymbol{\sigma }}^\prime }} \right) - {{\boldsymbol{w}}_{t,i}}({\boldsymbol{\sigma }})} \right) + \gamma ({\boldsymbol{\sigma }}) + \gamma \left( {{{\boldsymbol{\sigma }}^\prime }} \right). \end{split} (4) 其中①处是由于{{\boldsymbol{w}}_t}\left( {{{\boldsymbol{\sigma }}^\prime }} \right)的近似最优化. 结合式(4)中的第1项和最后1项,得
{{\boldsymbol{w}}}_{t,i}\left({\boldsymbol{\sigma}}^{\prime }\right)\ge {\boldsymbol{w}}_{t,i}(\boldsymbol{\sigma})-\dfrac{2\gamma (\boldsymbol{\sigma})}{c}-\beta . 证毕.
定理2证明了包含随机扰动项{\boldsymbol{\sigma }}的广义在线点对学习的稳定性. 通过观察扰动向量的变化对广义在线点对学习的输出产生的影响,表明了其在一维情况下的单调性,由于在线学习中稳定性是指相邻2次迭代得到的假设之间的距离,因此,定理2得到的即为一维情况下的稳定性.
定理3. {{\boldsymbol{w}}_t}({\boldsymbol{\sigma }})为广义在线点对学习在第t次迭代时的假设,其中{\boldsymbol{\sigma }}为随机扰动. {{\boldsymbol{e}}_i}表示第i个标准基向量,{{\boldsymbol{w}}_{t,i}}表示{{\boldsymbol{w}}_t}在i坐标上的假设. 假设{\left\| {{{\boldsymbol{w}}_t}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1}}({\boldsymbol{\sigma }})} \right\|_1} \leqslant 10d\left| {{{\boldsymbol{w}}_{t,i}}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1,i}}({\boldsymbol{\sigma }})} \right|,对于{{\boldsymbol{\sigma }}^\prime } = {\boldsymbol{\sigma }} + 100Gd{{\boldsymbol{e}}_i},有单调性成立:
\begin{split}& \mathrm{min}\left({\boldsymbol{w}}_{t,i}\left({\boldsymbol{\sigma} }^{\prime }\right),{\boldsymbol{w}}_{t+1,i}\left({\boldsymbol{\sigma} }^{\prime }\right)\right)\ge \mathrm{max}\left({\boldsymbol{w}}_{t,i}(\boldsymbol{\sigma} ),{\boldsymbol{w}}_{t+1,i}(\boldsymbol{\sigma} )\right)-\\& \dfrac{1}{10}\left|{\boldsymbol{w}}_{t,i}(\boldsymbol{\sigma} )-{\boldsymbol{w}}_{t+1,i}(\boldsymbol{\sigma} )\right|-\dfrac{3\left(\alpha +\beta {\Vert \boldsymbol{\sigma} \Vert }_{1}\right)}{100Gd}-\beta . \end{split} 证明. 令{\ell _{1:t}}({\boldsymbol{w}},z,{z^\prime }) =\displaystyle \sum\limits_{i = 1}^t {{\ell _i}} ({\boldsymbol{w}},z,{z^\prime }),{{\boldsymbol{\sigma }}^\prime } = {\boldsymbol{\sigma }} + c{{\boldsymbol{e}}_i},\gamma ({\boldsymbol{\sigma }}) = \alpha + \beta {\left\| {\boldsymbol{\sigma }} \right\|_1}为离线神谕的近似误差. 由{{\boldsymbol{w}}_t}({\boldsymbol{\sigma }})的近似最优化,得
\begin{split}&{\ell }_{1:t-1}\left({\boldsymbol{w}}_{t}(\boldsymbol{\sigma} ),z,{z}^{\prime }\right)-\langle \boldsymbol{\sigma} ,{\boldsymbol{w}}_{t}(\boldsymbol{\sigma} )\rangle +{\ell }_{t}\left({\boldsymbol{w}}_{t}(\boldsymbol{\sigma} ),z,{z}^{\prime }\right)\le \\& {\ell }_{1:t-1}\left({\boldsymbol{w}}_{t+1}(\boldsymbol{\sigma} ),z,{z}^{\prime }\right)-\langle \boldsymbol{\sigma} ,{\boldsymbol{w}}_{t+1}(\boldsymbol{\sigma} )\rangle +\\& {\ell }_{t}\left({\boldsymbol{w}}_{t}(\boldsymbol{\sigma} ),z,{z}^{\prime }\right)+\gamma (\boldsymbol{\sigma} )\stackrel{\textcircled{\scriptsize{1}}}{\le }\\& {\ell }_{1:t-1}\left({\boldsymbol{w}}_{t+1}(\boldsymbol{\sigma} ),z,{z}^{\prime }\right)-\langle \boldsymbol{\sigma} ,{\boldsymbol{w}}_{t+1}(\boldsymbol{\sigma} )\rangle +\\& {\ell }_{t}\left({\boldsymbol{w}}_{t+1}(\boldsymbol{\sigma} ),z,{z}^{\prime }\right)+G{\Vert {\boldsymbol{w}}_{t}(\boldsymbol{\sigma} )-{\boldsymbol{w}}_{t+1}(\boldsymbol{\sigma} )\Vert }_{1}+\gamma (\boldsymbol{\sigma} )\stackrel{\textcircled{\scriptsize{2}}}{\le }\\& {\ell }_{1:t-1}\left({\boldsymbol{w}}_{t+1}(\boldsymbol{\sigma} ),z,{z}^{\prime }\right)-\langle \boldsymbol{\sigma} ,{\boldsymbol{w}}_{t+1}(\boldsymbol{\sigma} )\rangle +\\& {\ell }_{t}\left({\boldsymbol{w}}_{t+1}(\boldsymbol{\sigma} ),z,{z}^{\prime }\right)+ 10Gd\left|{\boldsymbol{w}}_{t,i}(\boldsymbol{\sigma} )-{\boldsymbol{w}}_{t+1,i}(\boldsymbol{\sigma} )\right|+\gamma (\boldsymbol{\sigma} ),\end{split} (5) 其中①处是由于{\ell _t}( \cdot )的Lipschitz连续性,②处是由于{\left\| {{{\boldsymbol{w}}_t}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1}}({\boldsymbol{\sigma }})} \right\|_1}上的假设. 由{{\boldsymbol{w}}_{t + 1}}\left( {{{\boldsymbol{\sigma }}^\prime }} \right)的近似最优化,得
\begin{split}&{\ell }_{1:t-1}\left({\boldsymbol{w}}_{t}(\boldsymbol{\sigma} ),z,{z}^{\prime }\right)-\langle \boldsymbol{\sigma} ,{\boldsymbol{w}}_{t}(\boldsymbol{\sigma} )\rangle +{\ell }_{t}\left({\boldsymbol{w}}_{t}(\boldsymbol{\sigma} ),z,{z}^{\prime }\right)=\\& {\ell }_{1:t-1}\left({\boldsymbol{w}}_{t}(\boldsymbol{\sigma} ),z,{z}^{\prime }\right)-\langle {\boldsymbol{\sigma} }^{\prime },{\boldsymbol{w}}_{t}(\boldsymbol{\sigma} )\rangle +{\ell }_{t}\left({\boldsymbol{w}}_{t}(\boldsymbol{\sigma} ),z,{z}^{\prime }\right)+\\& \langle 100Gd{{\boldsymbol{e}}}_{i},{\boldsymbol{w}}_{t}(\boldsymbol{\sigma} )\rangle \ge \\& {\ell }_{1:t-1}\left({\boldsymbol{w}}_{t+1}\left({\boldsymbol{\sigma} }^{\prime }\right),z,{z}^{\prime }\right)-\langle {\boldsymbol{\sigma} }^{\prime },{\boldsymbol{w}}_{t+1}\left({\boldsymbol{\sigma} }^{\prime }\right)\rangle +\\& {\ell }_{t}\left({\boldsymbol{w}}_{t+1}\left({\boldsymbol{\sigma} }^{\prime }\right),z,{z}^{\prime }\right)+100Gd{\boldsymbol{w}}_{t,i}(\boldsymbol{\sigma} )-\gamma \left({\boldsymbol{\sigma} }^{\prime }\right)=\\& {\ell }_{1:t-1}\left({\boldsymbol{w}}_{t+1}\left({\boldsymbol{\sigma} }^{\prime }\right),z,{z}^{\prime }\right)-\langle \boldsymbol{\sigma} ,{\boldsymbol{w}}_{t+1}\left({\boldsymbol{\sigma} }^{\prime }\right)\rangle -\\& \gamma \left({\boldsymbol{\sigma} }^{\prime }\right)+{\ell }_{t}\left({\boldsymbol{w}}_{t+1}\left({\boldsymbol{\sigma} }^{\prime }\right),z,{z}^{\prime }\right)+\\& 100Gd\left({\boldsymbol{w}}_{t,i}(\boldsymbol{\sigma} )-{\boldsymbol{w}}_{t+1,i}\left({\boldsymbol{\sigma} }^{\prime }\right)\right)\stackrel{\textcircled{\scriptsize{1}}}{\ge }{\ell }_{1:t-1}\left({\boldsymbol{w}}_{t+1}(\boldsymbol{\sigma} ),z,{z}^{\prime }\right)-\\& \langle \boldsymbol{\sigma} ,{\boldsymbol{w}}_{t+1}(\boldsymbol{\sigma} )\rangle +{\ell }_{t}\left({\boldsymbol{w}}_{t+1}(\boldsymbol{\sigma} ),z,{z}^{\prime }\right)+\\& 100Gd\left({\boldsymbol{w}}_{t,i}(\boldsymbol{\sigma} )-{\boldsymbol{w}}_{t+1,i}\left({\boldsymbol{\sigma} }^{\prime }\right)\right)-\gamma \left({\boldsymbol{\sigma} }^{\prime }\right)-\gamma (\boldsymbol{\sigma} ),\end{split} (6) 其中①处是由于{{\boldsymbol{w}}_{t + 1}}({\boldsymbol{\sigma }})的近似最优化. 结合式(5)和式(6),得
\begin{split}&{\boldsymbol{w}}_{t+1,i}\left({\boldsymbol{\sigma} }^{\prime }\right)-{\boldsymbol{w}}_{t,i}(\boldsymbol{\sigma} )\ge \\& -\dfrac{1}{10}\left|{\boldsymbol{w}}_{t,i}(\boldsymbol{\sigma} )-{\boldsymbol{w}}_{t+1,i}(\boldsymbol{\sigma} )\right|-\dfrac{3\gamma (\boldsymbol{\sigma} )}{100Gd}-\beta . \end{split} (7) 同理
\begin{split}&{\boldsymbol{w}}_{t,i}\left({\boldsymbol{\sigma} }^{\prime }\right)-{\boldsymbol{w}}_{t+1,i}(\boldsymbol{\sigma} )\ge \\& -\dfrac{1}{10}\left|{\boldsymbol{w}}_{t,i}(\boldsymbol{\sigma} )-{\boldsymbol{w}}_{t+1,i}(\boldsymbol{\sigma} )\right|-\dfrac{3\gamma (\boldsymbol{\sigma} )}{100Gd}-\beta . \end{split} (8) 从定理2中的单调性,得
{{\boldsymbol{w}}_{t + 1,i}}\left( {{{\boldsymbol{\sigma }}^\prime }} \right) - {{\boldsymbol{w}}_{t + 1,i}}({\boldsymbol{\sigma }}) \geqslant - \frac{{3\gamma ({\boldsymbol{\sigma }})}}{{100Gd}} - \beta ,\quad (9) {\boldsymbol{w}}_{t,i}\left({\boldsymbol{\sigma} }^{\prime }\right)-{\boldsymbol{w}}_{t,i}(\boldsymbol{\sigma} )\ge -\frac{3\gamma (\boldsymbol{\sigma} )}{100Gd}-\beta . (10) 结合上述不等式(7)~(10)得证. 证毕.
定理3证明了d维情况下广义在线点对学习的稳定性. 虽然d维相较于一维的单调性证明会更具挑战性,但可以通过分别改变扰动项的每个坐标来有效地将分析减少到一维. 同理,定理2由单调性表明在线点对学习的稳定性可得d维情况下的稳定性.
3. 非凸广义在线点对学习的遗憾界
利用定理1所给出的非凸的广义在线点对学习稳定性分析,对其遗憾界进行研究. 由于定理2和定理3分别给出了一维和高维情况稳定性{E}\left[ {{{\left\| {{{\boldsymbol{w}}_t} - {{\boldsymbol{w}}_{t + 1}}} \right\|}_1}} \right]的界,结合定理2和定理3引导定理4,对定理4进行讨论.
定理4. 假设D为决策集W \subseteq {\mathbb{R}^d}的界. 学习者遭受的损失满足{\ell _1}范数的G-Lipschitz连续性. 学习者可访问 (\alpha, \beta) -近似神谕. 对于任意\eta ,广义在线点对学习的假设都满足遗憾界:
\begin{split}&\left[\dfrac{1}{T}{\displaystyle \sum _{t=1}^{T}{\ell }_{t}}\left({{\boldsymbol{w}}}_{t},z,{z}^{\prime }\right)-\dfrac{1}{T}\underset{{\boldsymbol{w}}\in W}{\mathrm{inf}}{\displaystyle \sum _{t=1}^{T}{\ell }_{t}}({\boldsymbol{w}},z,{z}^{\prime })\right]\le \\ & O\left(\eta {d}^{2}D{G}^{2}+\dfrac{d(\beta T+D)}{\eta T}+\alpha +\beta dG\right). \end{split} 证明. 使用与定理2和定理3中相同的符号定义,{E}\left[ {{{\left\| {{{\boldsymbol{w}}_t}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1}}({\boldsymbol{\sigma }})} \right\|}_1}} \right]也记作
\begin{split} & {E}\left[ {{{\left\| {{{\boldsymbol{w}}_t}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1}}({\boldsymbol{\sigma }})} \right\|}_1}} \right] =\displaystyle \sum\limits_{i = 1}^d {E} \left[ {\left| {{{\boldsymbol{w}}_{t,i}}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1,i}}({\boldsymbol{\sigma }})} \right|} \right]. \\ \end{split} (11) 因此,由{E}\left[ {\left| {{{\boldsymbol{w}}_{t,i}}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1,i}}({\boldsymbol{\sigma }})} \right|} \right],\forall i \in [d]的界,可得{E}\left[ {{{\left\| {{{\boldsymbol{w}}_t}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1}}({\boldsymbol{\sigma }})} \right\|}_1}} \right]的界. 对于任意的i \in [d],定义{{E}_{ - i}}\left[ {\left| {{{\boldsymbol{w}}_{t,i}}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1,i}}({\boldsymbol{\sigma }})} \right|} \right]为
\begin{split}&{{E}}_{-i}\left[\left|{\boldsymbol{w}}_{t,i}(\boldsymbol{\sigma} )-{\boldsymbol{w}}_{t+1,i}(\boldsymbol{\sigma} )\right|\right] := {E} \left[\left| {\boldsymbol{w}}_{t,i}(\boldsymbol{\sigma} )-{\boldsymbol{w}}_{t+1,i}(\boldsymbol{\sigma} ) \right| \bigg| {\left\{{\sigma }_{j}\right\}}_{j\ne i}\right],\end{split} 其中 {\sigma _j} 是{\boldsymbol{\sigma }}第 j 个坐标值.
令{{\boldsymbol{w}}_{\max ,i}}({\boldsymbol{\sigma }}) = \max \left( {{{\boldsymbol{w}}_{t,i}}({\boldsymbol{\sigma }}),{{\boldsymbol{w}}_{t + 1,i}}({\boldsymbol{\sigma }})} \right). 类似地,令{{\boldsymbol{w}}_{\min ,i}}({\boldsymbol{\sigma }}) = \min \left( {{{\boldsymbol{w}}_{t,i}}({\boldsymbol{\sigma }}),{{\boldsymbol{w}}_{t + 1,i}}({\boldsymbol{\sigma }})} \right). 则
\begin{split} &{{E}_{ - i}}\left[ {\left| {{{\boldsymbol{w}}_{t,i}}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1,i}}({\boldsymbol{\sigma }})} \right|} \right] = {{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\max ,i}}({\boldsymbol{\sigma }})} \right] - {{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\min ,i}}({\boldsymbol{\sigma }})} \right]. \\ \end{split} 定义
\varepsilon = \left\{ \begin{gathered} {\boldsymbol{\sigma }}:{\left\| {{{\boldsymbol{w}}_t}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1}}({\boldsymbol{\sigma }})} \right\|_1} \leqslant \\ 10d\left| {{{\boldsymbol{w}}_{t,i}}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1,i}}({\boldsymbol{\sigma }})} \right| \\ \end{gathered} \right\}. 对式(12)中的{T_1},{T_2}求取下界:
\begin{split} &{{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\min ,i}}({\boldsymbol{\sigma }})} \right] = \\ & {P}\left( {{\sigma _i} < 100Gd} \right)\underbrace {{{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{{\text{min}},i}}({\boldsymbol{\sigma }})|{\sigma _i} < 100Gd} \right]}_{{T_1}} + \\ & \underbrace {{P}\left( {{\sigma _i} \geqslant 100Gd} \right){{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\min ,i}}({\boldsymbol{\sigma }})|{\sigma _i} \geqslant 100Gd} \right]}_{{T_2}}. \\ \end{split} (12) 由于第 i 个坐标的域位于某个长度为D的区间内,并且由于{T_1}和{{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\max ,i}}({\boldsymbol{\sigma }})} \right]是这个区间的点,它们的差值被D所界限. 所以{T_1}的下界为{{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\max ,i}}({\boldsymbol{\sigma }})} \right] - D.将{T_2}重新定义为:
\begin{split} &{T_2} = {P}\left( {{\sigma _i} \geqslant 100Gd} \right){{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\min ,i}}({\boldsymbol{\sigma }})|{B_i} \geqslant 100Gd} \right] = \\ & \displaystyle \int_{{\sigma _i} = 100Gd}^\infty {{{\boldsymbol{w}}_{\min ,i}}} ({\boldsymbol{\sigma }})P\left( {{\sigma _i}} \right){\text{d}}{\sigma _i} = \\ & \displaystyle \int_{{\sigma _i} = 100Gd}^\infty {{{\boldsymbol{w}}_{\min ,i}}} ({\boldsymbol{\sigma }})\eta \exp ( - \eta {\sigma _i}{\text{)d}}{\sigma _i}. \\ \end{split} 改变积分中的1个变量,令{\sigma _i} = \sigma _i^\prime + 100Gd且{{\boldsymbol{\sigma }}^\prime } = \left( {{\sigma _1},{\sigma _2}, … ,{\sigma _{i - 1}},\sigma _i^\prime ,{\sigma _{i + 1}}, … } \right)是将在第 i 个坐标的{\boldsymbol{\sigma }}替换为\sigma _i^\prime 得到的向量,得
\begin{split} & \int_{{\sigma _i} = 100Gd}^\infty {{{\boldsymbol{w}}_{\min ,i}}} ({\boldsymbol{\sigma }})\eta \exp \left( { - \eta {\sigma _i}} \right){\text{d}}{\sigma _i} = \\ & \int_{\sigma _i^\prime = 0}^\infty {{{\boldsymbol{w}}_{\min ,i}}} \left( {{{\boldsymbol{\sigma }}^\prime } + 100Gd{{\boldsymbol{e}}_i}} \right)\eta \exp \left( { - \eta \left( {\sigma _i^\prime + 100Gd} \right)} \right){\text{d}}\sigma _i^\prime = \\ & \exp \left( { - 100\eta Gd} \right) \times\\ & \int_{\sigma _i^\prime = 0}^\infty {{{\boldsymbol{w}}_{\min ,i}}} \left( {{{\boldsymbol{\sigma }}^\prime } + 100Gd{{\boldsymbol{e}}_i}} \right)\eta \exp \left( { - \eta \sigma _i^\prime } \right){\text{d}}\sigma _i^\prime = \\ & \exp \left( { - 100\eta Gd} \right){{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\min ,i}}\left( {{{\boldsymbol{\sigma }}^\prime } + 100Gd{{\boldsymbol{e}}_i}} \right)} \right]. \\ \end{split} 则{T_2} = \exp \left( { - 100\eta Gd} \right){{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\min ,i}}\left( {{\boldsymbol{\sigma }} + 100Gd{{\boldsymbol{e}}_i}} \right)} \right]. 将{T_1},{T_2}的下界代入式(12)中,可得
\begin{split} & {{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\min ,i}}({\boldsymbol{\sigma }})} \right] \geqslant \\ & (1 - \exp ( - 100\eta Gd))\left( {{{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\max ,i}}({\boldsymbol{\sigma }})} \right] - D} \right) + \\ & \exp ( - 100\eta Gd){{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\min ,i}}\left( {{\boldsymbol{\sigma }} + 100Gd{{\boldsymbol{e}}_i}} \right)} \right]. \\ \end{split} 则{{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\min ,i}}({\boldsymbol{\sigma }})} \right]下界:
\begin{split}& {{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\min ,i}}({\boldsymbol{\sigma }})} \right] \geqslant \\ & (1 - \exp ( - 100\eta Gd))\left( {{{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\max ,i}}({\boldsymbol{\sigma }})} \right] - D} \right) + \\ & \exp ( - 100\eta Gd) \times\\ \end{split} \begin{split}& {{P}_{ - i}}({\varepsilon}){{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\min ,i}}\left( {{\boldsymbol{\sigma }} + 100Gd{{\boldsymbol{e}}_i}} \right)|{\varepsilon}} \right] + \\ & \exp ( - 100\eta Gd)\times \\ & {{P}_{ - i}}\left( {{{\varepsilon}^c}} \right){{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\min ,i}}\left( {{\boldsymbol{\sigma }} + 100Gd{{\boldsymbol{e}}_i}} \right)|{{\varepsilon}^c}} \right], \qquad\;\;\\ \end{split} 其中{{P}_{ - i}}({\varepsilon}): = {P}\left( {{\varepsilon}|{{\left\{ {{\sigma _j}} \right\}}_{j \ne i}}} \right). 由定理2和定理3证明的单调性,得{{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\min ,i}}({\boldsymbol{\sigma }})} \right]下界. \gamma ({\boldsymbol{\sigma }}) = \alpha + \beta {\left\| {\boldsymbol{\sigma }} \right\|_1},则
\begin{split}& {{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\min ,i}}({\boldsymbol{\sigma }})} \right] \geqslant \\ & (1 - \exp ( - 100\eta Gd))\left( {{{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\max ,i}}({\boldsymbol{\sigma }})} \right] - D} \right) + \\ & \exp ( - 100\eta Gd){{P}_{ - i}}({\varepsilon})\times \\ & {{E}_{ - i}}\left. {\left[ \begin{gathered} {{\boldsymbol{w}}_{\max ,i}}({\boldsymbol{\sigma }}) - \frac{1}{{10}}\left| {{{\boldsymbol{w}}_{t,i}}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1,i}}({\boldsymbol{\sigma }})} \right| - \\ \frac{{3\gamma ({\boldsymbol{\sigma }})}}{{100Gd}} - \beta |{\varepsilon} \\ \end{gathered} \right.} \right] + \\ & \exp ( - 100\eta Gd){{P}_{ - i}}\left( {{{\varepsilon}^c}} \right){{E}_{ - i}}\left[ \begin{gathered} {{\boldsymbol{w}}_{\min ,i}}({\boldsymbol{\sigma }}) - \frac{{2\gamma ({\boldsymbol{\sigma }})}}{{100Gd}} - \\ \beta \mid {{\varepsilon}^c} \\ \end{gathered} \right] \geqslant \\& (1 - \exp ( - 100\eta Gd))\left( {{{E}_{ - i}}\left[ {{{\boldsymbol{w}}_{\max ,i}}({\boldsymbol{\sigma }})} \right] - D} \right) + \\& \exp ( - 100\eta Gd){{P}_{ - i}}({\varepsilon}){{E}_{ - i}}\left. {\left[ \begin{gathered} {{\boldsymbol{w}}_{\max ,i}}({\boldsymbol{\sigma }}) - \\ \frac{1}{{10}}\left| {{{\boldsymbol{w}}_{t,i}}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1,i}}({\boldsymbol{\sigma }})} \right| - \\ \frac{{3\gamma ({\boldsymbol{\sigma }})}}{{100Gd}} - \beta |{\varepsilon} \\ \end{gathered} \right.} \right] + \\ & \exp ( - 100\eta Gd)\times \\& {{P}_{ - i}}\left( {{{\varepsilon}^c}} \right){{E}_{ - i}}\left. {\left[ \begin{gathered} {{\boldsymbol{w}}_{\max ,i}}({\boldsymbol{\sigma }}) - \frac{1}{{10d}}{\left\| {{{\boldsymbol{w}}_t}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1}}({\boldsymbol{\sigma }})} \right\|_1} - \\ \frac{{2\gamma ({\boldsymbol{\sigma }})}}{{100Gd}} - \beta |{{\varepsilon}^c} \\ \end{gathered} \right.} \right], \\ \end{split} (13) 其中式(13)中第1个不等式来自于定理2和定理3,第2个不等式来自于{\varepsilon ^c}的定义. 由{{P}_{ - i}}(\varepsilon) \leqslant 1,可得
$$ \begin{split} & {{{E}_{ - i}}\left[ {{\boldsymbol{w}_{{\rm{min}},i}}(\boldsymbol{\sigma} )} \right] \ge }\\& {(1 - {\rm{exp}}( - 100\eta Gd))\left( {{{E}_{ - i}}\left[ {{\boldsymbol{w}_{{\rm{max}},i}}(\boldsymbol{\sigma} )} \right] - D} \right) + }\\& {{\rm{exp}}( - 100\eta Gd){{E}_{- i}}\left[ {{\boldsymbol{w}_{{\rm{max}},i}}(\boldsymbol{\sigma} ) - \dfrac{{3\gamma (\boldsymbol{\sigma} )}}{{100Gd}} - \beta } \right] - }\\& {{\rm{exp}}( - 100\eta Gd){{E}_{ - i}}\left[{{\begin{array}{*{20}{l}} {\dfrac{1}{{10}}\left| {{\boldsymbol{w}_{t,i}}(\boldsymbol{\sigma} ) - {\boldsymbol{w}_{t + 1,i}}(\boldsymbol{\sigma} )} \right| + }\\ \dfrac{1}{{10d}}\left\| {{\boldsymbol{w}_t}(\boldsymbol{\sigma} ) - {\boldsymbol{w}_{t + 1}}(\boldsymbol{\sigma} )} \right\|_1 \end{array}}}\right]{\stackrel{\textcircled{\scriptsize{1}}} {\ge}} } \\& {{{E}_{ - i}}\left[ {{\boldsymbol{w}_{{\rm{max}},i}}(\boldsymbol{\sigma} )} \right] - 100\eta GdD - \dfrac{{3\gamma (\boldsymbol{\sigma} )}}{{100Gd}} - } \\& \beta - {{E}_{ - i}}\left[ {\dfrac{{\dfrac{1}{{10}}\left| {{{{\boldsymbol{w}}}_{t,i}}({{\boldsymbol{\sigma} }}) - {{{\boldsymbol{w}}}_{t + 1,i}}({{\boldsymbol{\sigma} }})} \right|{\rm{ + }}}}{{\dfrac{1}{{10d}}{{\left\| {{{{\boldsymbol{w}}}_t}({{\boldsymbol{\sigma} }}) - {{{\boldsymbol{w}}}_{t + 1}}({{\boldsymbol{\sigma} }})} \right\|}_1}}}} \right], \end{split} $,$ 其中①处是由于\exp ({\boldsymbol{w}}) \geqslant 1 + {\boldsymbol{w}}. 得
\begin{split} &{{E}_{ - i}}\left[ {\left| {{{\boldsymbol{w}}_{t,i}}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1,i}}({\boldsymbol{\sigma }})} \right|} \right] \leqslant \\ & \frac{1}{{9d}}{{E}_{ - i}}\left[ {{{\left\| {{{\boldsymbol{w}}_t}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1}}({\boldsymbol{\sigma }})} \right\|}_1}} \right] + \\ & \frac{{1000}}{9}\eta GdD + \frac{{{{E}_{ - i}}[\gamma ({\boldsymbol{\sigma }})]}}{{30Gd}} + \frac{{10}}{9}\beta . \end{split} (14) 由于式(14)对任意{\left\{ {{\sigma _j}} \right\}_{j \ne i}}均成立,因此可得无条件期望的界:
\begin{split} &{{E}_{ - i}}\left[ {\left| {{{\boldsymbol{w}}_{t,i}}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1,i}}({\boldsymbol{\sigma }})} \right|} \right] \leqslant \\ & \frac{1}{{9d}}{{E}_{ - i}}\left[ {{{\left\| {{{\boldsymbol{w}}_t}({\boldsymbol{\sigma }}) - {{\boldsymbol{w}}_{t + 1}}({\boldsymbol{\sigma }})} \right\|}_1}} \right] + \\ & \frac{{1000}}{9}\eta GdD + \dfrac{{{E}[\gamma ({\boldsymbol{\sigma }})]}}{{30Gd}} + \frac{{10}}{9}\beta . \end{split} (15) 将式(15)代入到式(11)中,可得稳定性界:
\begin{split}&{E}\left[{\Vert {\boldsymbol{w}}_{t}(\boldsymbol{\sigma})-{\boldsymbol{w}}_{t+1}(\boldsymbol{\sigma})\Vert }_{1}\right]\le \\& 125\eta G{d}^{2}D+\dfrac{\beta d}{20\eta G}+2\beta d+\dfrac{\alpha }{20G}\text{,}\end{split} (16) 将式(16)代入式(2)中,得证. 证毕.
由定理4知,若取\eta = \dfrac{d}{{{d^{3/2}}{T^{1/2}} - d}},可得遗憾界O\left( {{d^{3/2}}{T^{ - 1/2}} + \alpha + \beta {d^{3/2}}{T^{1/2}}} \right). 此外,若取\alpha = O\left( {{T^{ - 1/2}}} \right), \; \beta = O\left( {{T^{ - 1}}} \right),可得遗憾界O\left( {{T^{ - 1/2}}} \right). 关于在线点对学习的理论分析的结论如表1所示.
表 1 在线点对学习遗憾界对比Table 1. Comparison of Online Pairwise Learning Regret Bounds由表1可知,本文对具有非凸损失函数的广义在线点对学习得到了O({T^{ - 1/2}})的遗憾界优于已有的遗憾界.
4. 结束语
本文通过引入随机噪声提出了广义在线点对学习框架. 基于文献[26]提出的近似神谕的近似最优假设,提出了非凸广义在线点对学习算法. 进一步,对广义在线点对学习进行泛化性研究,通过稳定性分析,得到广义在线点对学习的遗憾界O({T^{ - 1/2}}).
基于在线梯度下降算法,可进一步研究具有非凸损失函数的在线点对学习的遗憾界. 此外,本文中的遗憾界和平均稳定性结果是建立在期望意义下的,而如何得到高概率的界也是未来可进行的工作.
作者贡献声明:郎璇聪负责研究方案的设计、证明推导,以及论文的撰写和修改;李春生指导论文撰写与修改;刘勇提出论文研究思路、设计研究方案;王梅指导论文结构设计.
方面指的是实体或实体的属性. -
表 1 Twitter2015数据集统计信息
Table 1 Statistics of Twitter2015 Dataset
数据集 情感数量 句子数量 方面数量 POS NEG Neutral 训练集 928 368 1 883 2101 3179 验证集 303 149 670 727 1122 测试集 317 113 607 674 1037 表 2 Restaurant2014数据集统计信息
Table 2 Statistics of Restaurant2014 Dataset
数据集 评分等级数量 句子数量 方面数量 level 1 level 2 level 3 level 4 训练集 1747 645 520 73 2436 2985 验证集 417 162 117 18 608 714 测试集 728 196 196 14 800 1134 注:level是按评分等级划分的数据集. 表 3 Restaurant2014数据集上UMAS-Text与现有方法的性能对比
Table 3 Performance Comparison of UMAS-Text and Existing Methods on Restaurant2014 Dataset
% 模型 AE -F1 SC-F1 AESC-F1 CMLA+TCap 81.91 71.32 65.68 DECNN+TCap 82.79 71.77 66.84 MNN 83.05 68.45 63.87 E2E-AESC 83.92 68.38 66.6 DOER 84.63 64.5 68.55 RACL 85.37 74.46 70.67 UMAS-Text 85.58 76.36 70.70 注:加粗数字表示最优结果. 表 4 Twitter2015数据集上AE性能对比
Table 4 Performance Comparison of AE on Twitter2015 Dataset
% 模型 AE -P AE -R AE -F1 VAM 58.10 56.70 57.39 ACN 79.10 71.17 74.92 UMT 78.50 79.56 79.02 UMAS(本文) 81.09 77.34 79.17 注:加粗数字表示最优结果. 表 5 Twitter2015数据集上SC性能对比
Table 5 Performance Comparison of SC on Twitter2015 Dataset
% 模型 SC-ACC SC-F1 Res-RAM 71.55 64.68 Res-RAM-TFN 69.91 61.49 Res-MGAN 71.65 63.88 Res-MGAN-TFN 70.3 64.14 MIMN 71.84 65.69 EASFN 73.38 67.37 UMAS(本文) 73.48 73.34 注:加粗数字表示最优结果. 表 6 Twitter2015数据集上AESC性能对比
Table 6 Performance Comparison of AESC on Twitter2015 Dataset
模型 AESC-F1/% 运行时间/s ACN-ESAFN 55.56 163 UMT-ESAFN 56.89 160 UMAS(本文) 58.05 10 注:加粗数字表示最优结果. 表 7 统一框架和单任务模型的对比
Table 7 Comparison of Unified Model and Single-Task Model
% 模型 AE-P AE-R AE-F1 SC-ACC SC-F1 AESC-F1 UMAS-AE 78.30 80.04 79.16 UMAS-SC 71.26 70.79 UMAS-Pipeline 78.30 80.04 79.16 71.26 70.79 56.76 UMAS(本文) 81.09 77.34 79.17 73.48 73.34 58.05 注:加粗数字表示最优结果. 表 8 消融实验结果
Table 8 Results of Ablation Experiment
% 模型 AE-P AE-R AE-F1 SC-ACC SC-F1 AESC-F1 UMAS-no_visual 77.67 75.80 76.72 71.26 71.27 54.76 UMAS-no_POS_features 76.59 77.63 77.11 71.26 70.73 54.69 UMAS-no_opinion 75.16 79.36 77.20 73.00 72.28 55.44 UMAS-no_self_attention 75.87 79.46 77.63 71.36 70.51 55.77 UMAS-no_gate_fusion 75.30 78.78 77.02 71.26 69.75 55.70 UMAS-special 76.46 77.05 76.75 71.36 71.36 54.76 UMAS-share 75.44 78.78 77.08 68.27 67.91 52.55 UMAS(本文) 81.09 77.34 79.17 73.48 73.34 58.05 注:加粗数字表示最优结果. 表 9 统计量说明
Table 9 Instruction of Statistics
统计对象 统计量 说明 combine_true_special_wrong UMAS-combine 预测正确而UMAS-special 预测错误的数量. 体现了UMAS-combine 对UMAS-special 的纠正能力. combine_wrong_special_true UMAS-combine 预测错误而UMAS-special 预测正确的数量. combine_true_share_wrong UMAS-combine 预测正确而UMAS-share 预测错误的数量. combine_wrong_share_true UMAS-combine 预测错误而UMAS-share 预测正确的数量. special_contribution UMAS-share 预测错误而UMAS-special 预测正确的数量. 体现了UMAS-special 的特殊贡献. share_contribution UMAS-special 预测错误而UMAS-share 预测正确的数量. -
[1] Ju Xincheng, Zhang Dong, Xiao Rong, et al. Joint multi-modal aspect-sentiment analysis with auxiliary cross-modal relation detection[C] // Proc of the 26th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 4395−4405
[2] Li Z Y, Cheng Wei, Kshetramade R, et al. Recommend for a reason: Unlocking the power of unsupervised aspect-sentiment co-extraction[C]// Proc of the 26th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 763−778
[3] Gong Chenggong, Yu Jianfei, Xia Rui. Unified feature and instance based domain adaptation for aspect-based sentiment analysis[C] // Proc of the 25th Conf on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA: ACL, 2020: 7035−7045
[4] Cai Guoyong, Xia Binbin. Convolutional Neural Networks for Multimedia Sentiment Analysis[M] //Natural Language Processing and Chinese Computing. Cham: Springer, 2015: 159−167
[5] Zadeh A, Chen Minghai, Poria S, et al. Tensor fusion network for multimodal sentiment analysis[C] //Proc of the 22nd Conf on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA: ACL, 2017: 1103−1114
[6] Mai Sijie, Hu Haifeng, Xing Songlong. Divide, conquer and combine: hierarchical feature fusion network with local and global perspectives for multimodal affective computing[C] //Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 481−492
[7] Zhang Qi, Fu Jinlan, Liu Xiaoyu, et al. Adaptive co-attention network for named entity recognition in tweets[C] //Proc of the 32nd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2018: 5674−5681.
[8] Yu Jianfei, Jiang Jing, Xia Rui. Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 429−439 doi: 10.1109/TASLP.2019.2957872
[9] Lu Di, Neves L, Carvalho V, et al. Visual attention model for name tagging in multimodal social media[C] //Proc of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2018: 1990−1999
[10] Yu Jianfei, Jiang Jing, Yang Li, et al. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer[C] //Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 3342−3352
[11] Honnibal M, Montani I, Van Landeghem S, et al. spaCy: Industrial-strength natural language processing in Python[EB/OL]. [2022-05-17].https://spacy.io
[12] 刘路路,杨燕,王杰. ABAFN:面向多模态的方面级情感分类模型[J]. 计算机工程与应用,2022,58(10):193−199 doi: 10.3778/j.issn.1002-8331.2108-0056 Liu Lulu, Yang Yan, Wang Jie. ABAFN: Aspect-based sentiment analysis model for multimodal[J]. Computer Engineering and Applications, 2022, 58(10): 193−199 (in Chinese) doi: 10.3778/j.issn.1002-8331.2108-0056
[13] Li Ruifan, Chen Hao, Feng Fangxiang, et al. Dual graph convolutional networks for aspect-based sentiment analysis[C] //Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing. Stroudsburg, PA: ACL, 2021: 6319−6329
[14] Xiao Zeguan, Wu Jiarun, Chen Qingliang, et al. BERT4GCN: Using BERT intermediate layers to augment GCN for aspect-based sentiment classification[J]. arXiv preprint, arXiv: 2110.00171, 2021
[15] 齐嵩喆,黄贤英,孙海栋,等. 基于渐进增强与图卷积的方面级情感分类模型[J]. 计算机应用研究,2022,39(7):2037−2042 Qi Songzhe, Huang Xianying, Sun Haidong, et al. Aspect based sentiment analysis with progressive enhancement and graph convolution[J]. Application Research of Computers, 2022, 39(7): 2037−2042 (in Chinese)
[16] 韩虎, 郝俊, 张千锟, 等. 知识增强的交互注意力方面级情感分类模型[J]. 计算机科学与探索, 2022 [2021-12-31]. http: //fcst.ceaj.org/CN/10.3778/j.issn.1673−9418.2108082 Han Hu, Hao Jun, Zhang Qiankun, et al. Knowledge-enhanced interactive attention model for aspect-based sentiment analysis[J/OL]. Journal of Frontiers of Computer Science and Technology, 2022 [2021-12-31]. http://fcst.ceaj.org/CN/10.3778/j.issn.1673−9418.2108082 (in Chinese)
[17] 毛腾跃, 郑志鹏, 郑禄. 基于改进自注意力机制的方面级情感分类[J]. 中南民族大学学报: 自然科学版, 2022, 41(1): 94−100 Mao Tengyue, Zheng Zhipeng, Zheng Lu. Aspect-level sentiment analysis based on improved self-attention mechanism[J]. Journal of South Central University for Nationalities: Natural Science Edition, 2022, 41(1): 94−100(in Chinese)
[18] 孙小婉,王英,王鑫,等. 面向双注意力网络的特定方面情感分析模型[J]. 计算机研究与发展,2019,56(11):2384−2395 doi: 10.7544/issn1000-1239.2019.20180823 Sun Xiaowan, Wang Ying, Wang Xin, et al. Aspect-based sentiment analysis model based on dual-attention networks[J]. Journal of Computer Research and Development, 2019, 56(11): 2384−2395 (in Chinese) doi: 10.7544/issn1000-1239.2019.20180823
[19] Ying Chengcan, Wu Zhen, Dai Xinyu, et al. Opinion transmission network for jointly improving aspect-oriented opinion words extraction and sentiment classification[C] // Proc of the 9th CCF Int Conf on Natural Language Processing and Chinese Computing. Cham: Springer, 2020: 629−640
[20] Oh S, Lee D, Whang T et al. Deep context- and relation-aware learning for aspect-based sentiment analysis[C] //Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing. Stroudsburg, PA: ACL, 2021: 495−503
[21] Chen Zhuang, Qian Tieyun. Relation-aware collaborative learning for unified aspect-based sentiment analysis[C] //Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 3685−3694
[22] Xu Lu, Li Hao, Lu Wei, et al. Position-aware tagging for aspect sentiment triplet extraction[C]//Proc of the 25th Conf on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA: ACL, 2020: 2339−2349
[23] Phan M H, Ogunbona P O. Modelling context and syntactical features for aspect-based sentiment analysis[C] //Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 3211−3220
[24] 薛芳,过弋,李智强,等. 基于双层词性感知和多头交互注意机制的方面级情感分类[J]. 计算机应用研究,2022,39(3):704−710 Xue Fang, Guo Yi, Li Zhiqiang, et al. Aspect-level sentiment analysis based on double-layer part-of-speech-aware and multi-head interactive attention mechanism[J]. Application Research of Computers, 2022, 39(3): 704−710 (in Chinese)
[25] He Ruidan, Lee W S, Ng H T, et al. An interactive multi-task learning network for end-to-end aspect-based sentiment analysis[C] //Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 504−515
[26] Wang Feixiang, Lan Man, Wang Wenting. Towards a one-stop solution to both aspect extraction and sentiment analysis tasks with neural multi-task learning[C/OL]// Proc of the 2018 Int Joint Conf on Neural Networks (IJCNN). Piscataway, NJ : IEEE, 2018 [2022-03-01].https://ieeexplore.ieee.org/abstract/document/8489042
[27] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint, arXiv: 1409.1556, 2014
[28] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C] // Proc of the 31st Conf on Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2017: 5998−6008
[29] Jeffrey P, Richard S, Christopher M. GloVe: Global vectors for word representation[C] //Proc of the 19th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2014: 1532−1543
[30] Wu Yuanbin, Zhang Qi, Huang Xuanjing, et al. Phrase dependency parsing for opinion mining[C] //Proc of the 14th Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2009: 1533−1541
[31] Xu Hu, Liu Bing, Shu Lei, et al. Double embeddings and CNN-based sequence labeling for aspect extraction[C] //Proc of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2018: 592−598
[32] Chen Zhuang, Qian Tieyun. Transfer capsule network for aspect level sentiment classification[C] //Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 547−556
[33] Li Xin, Bing Lidong, Li Piji, et al. A unified model for opinion target extraction and target sentiment prediction[C] //Proc of the 33rd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2019: 6714−6721
[34] Luo Huaishao, Li Tianrui, Liu Bing, et al. DOER: Dual cross-shared RNN for aspect term-polarity co-extraction[C] //Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 591−601
[35] Hazarika D, Poria S, Zadeh A, et al. Conversational memory network for emotion recognition in dyadic dialogue videos[C] //Proc of the 16th Conf on Association for Computational Linguistics, North American Chapter. Stroudsburg, PA: ACL, 2018: 2122−2132
[36] Chen Peng, Sun Zhongqian, Bing Lidong, et al. Recurrent attention network on memory for aspect sentiment analysis[C]// Proc of the 22nd Conf on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA: ACL, 2017: 452−461
[37] Fan Feifan, Feng Yansong, Zhao Dongyan. Multi-grained attention network for aspect-level sentiment classification[C] //Proc of the 23rd Conf on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA: ACL, 2018: 3433−3442
[38] Xu Nan, Mao Wenji, Chen Guandan. Multi-interactive memory network for aspect based multimodal sentiment analysis[C] //Proc of the 33rd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2019: 371−378