多模态深度伪造及检测技术综述

李泽宇; 张旭鸿; 蒲誉文; 伍一鸣; 纪守领

doi:10.7544/issn1000-1239.202111119

多模态深度伪造及检测技术综述

1.
浙江大学计算机科学与技术学院　杭州　310007
2.
浙江大学控制科学与工程学院　杭州　310007

详细信息

作者简介:
李泽宇: 1999年生. 硕士研究生. 主要研究方向为人工智能安全和计算机视觉

张旭鸿: 1988年生. 博士，副研究员. CCF会员. 主要研究方向为人工智能安全、数据驱动软件与系统安全、大数据系统与分析

蒲誉文: 1993年生. 博士. 主要研究方向为隐私计算和人工智能安全

伍一鸣: 1996年生. 博士研究生. 主要研究方向为数据驱动安全、黑灰产业挖掘、网络犯罪研究

纪守领: 1986年生. 博士，教授，博士生导师. CCF高级会员. 主要研究方向为数据驱动安全和隐私、人工智能安全、大数据挖掘与分析

通讯作者:
纪守领（sji@zju.edu.cn）

中图分类号: TP391
计量
- 文章访问数: 1286
- HTML全文浏览量: 147
- PDF下载量: 501
出版历程
- 收稿日期: 2021-11-11
- 修回日期: 2022-08-28
- 网络出版日期: 2023-03-19
- 刊出日期: 2023-05-31

A Survey on Multimodal Deepfake and Detection Techniques

1.
College of Computer Science and Technology, Zhejiang University, Hangzhou 310007
2.
College of Control Science and Engineering, Zhejiang University, Hangzhou 310007

More Information

Author Bio:
Li Zeyu: born in 1999. Master candidate. His main research interests include AI security and computer vision

Zhang Xuhong: born in 1988. PhD, associate professor. Member of CCF. His main research interests include AI security, data driven software and system security, big data systems and analytics

Pu Yuwen: born in 1993. PhD. His main research interests include privacy computing and AI security

Wu Yiming: born in 1996. PhD candidate. Her main research interests include data driven security, black industry mining, and cybercrime research

Ji Shouling: born in 1986. PhD, professor, PhD supervisor. Senior member of CCF. His main research interests include data-driven security and privacy, AI security, big data mining and analytics

摘要

摘要:
随着各种深度学习生成模型在各领域的应用，生成的多媒体文件的真伪越来越难以辨别，深度伪造技术也因此得以诞生和发展. 深度伪造技术通过深度学习相关技术能够篡改视频或者图片中的人脸身份信息、表情和肢体动作，以及生成特定人物的虚假语音. 自2018年Deepfakes技术在社交网络上掀起换脸热潮开始，大量的深度伪造方法被提出，并展现了其在教育、娱乐等领域的潜在应用. 但同时深度伪造技术在社会舆论、司法刑侦等方面产生的负面影响也不容忽视. 因此有越来越多的对抗手段被提出用于防止深度伪造被不法分子所应用，如深度伪造的检测和水印. 首先，针对不同模态类型的深度伪造技术以及相应的检测技术进行了回顾和总结，并根据研究目的和研究方法对现有的研究进行了分析和归类；其次，总结了近年研究中广泛使用的视频和音频数据集；最后，探讨了该领域未来发展面临的机遇和挑战.
- 深度伪造 /
- 深度伪造检测 /
- 深度学习 /
- 人脸替换 /
- 生成对抗网络
Abstract:
With the application of all kinds of deep learning generation models in various fields, the authenticity of their generated multimedia files has become increasingly difficult to distinguish, therefore, deepfake technology has been born and developed. Utilizing deep learning related techniques, the deepfake technology can tamper with the facial identity information, expressions, and body movements in videos or pictures, and generate fake voice of a specific person. Since 2018, when Deepfakes sparked a wave of face swapping on social networks, a large number of deepfake methods have been proposed, which had demonstrated their potential applications in education, entertainment, and some other fields. But at the same time, the negative impact of deepfake on public opinion, judicial and criminal investigations, etc. can not be ignored. As a consequence, more and more countermeasures have been proposed to prevent deepfake from being utilized by the criminals, such as the detection of deepfake and watermark. Firstly, a review and summary of deepfake technologies of different modal types and corresponding detection technologies are carried out, and the existing researches are analyzed and classified according to the research purpose and research method. Secondly, the video and audio datasets widely used in the recent studies are summarized. Finally, the opportunities and challenges for future development in this field are discussed.
- deepfake /
- deepfake detection /
- deep learning /
- face replacement /
- generative adversarial network

HTML全文

命名实体识别（named entity recognition, NER）旨在从文本中定位命名实体，并将其分类到预定义的实体类型，如人、组织和位置.NER是自然语言处理（natural language processing, NLP）的基本任务，有助于各种下游应用，如关系抽取^[1]、问答系统^[2]、知识库的构建^[3-6].

传统的NER监督方法如BERT-CRF^[7]和指针网络^[8]严重依赖于大量的标注数据，而数据的标注过程往往既费时又费力. 因此，远程监督技术被提出用于自动生成NER的标注数据，其核心思想是识别文本中存在于知识库，如维基数据开放知识库的实体提及，并将相应类型分配给它们. 然而，使用远程监督技术会产生2类噪声：假阴性（false negatives, FNs）和假阳性（false positives, FPs）^[9]. 首先，由于知识库覆盖的范围有限，文本中并非所有正确实体都会被标注，因此会产生FNs. 其次，由于使用简单的字符串匹配来识别实体提及，知识库中实体的模糊性可能会导致FPs. 图1展示了一个远程监督标注示例，其中“PRO”指产品名称类型，“PER”指人名. 第1行是初始文本，第2行是远程监督标注，第3行是正确标注. 示例中，由于知识库的规模有限，产品实体“拖把”没有被正确匹配，这属于FNs. 此外，示例中的“包”表示一个量词，而不是一个产品，但因为知识库的模糊性被错误匹配，这属于FPs.

图 1 远程监督标注示例

Figure 1. An example of distantly supervised annotation

下载: 全尺寸图片幻灯片

为了解决上述远程监督NER的噪声问题，研究者提出了一系列噪声检测的方法. 这些方法主要可以被分为2类：一类是在训练过程中设计样本降噪策略来减小噪声对模型的负面影响. 常见的降噪策略有数据聚类^[10]、负采样^[11-12]等. 然而，这类方法仅能处理FNs噪声，仍无法解决远程监督过程中的FPs噪声. 另一类是是在训练之前设计噪声过滤手段来删除训练集中的噪声样本，该方法可以同时处理FNs与FPs这2类噪声，但是对噪声过滤的准确性有较高要求. 此外，由于噪声过滤过程的试错搜索与延迟反馈两大特征，许多研究者将其视为一个决策问题，并使用强化学习的强大决策能力来解决. 典型的方法是制定不同的奖励和策略，并使用强化学习框架训练一个噪声识别器模型^[13-14]. 然而，这类方法都以句子为单位进行噪声检测，可能会丢弃其中正确的实体标注信息，进而无法为模型提供充足的训练语料. 比如，在图1中，模型可能会因为“包”和“拖把”这2个噪声实体把整个语句删除，导致正确的实体标注信息“小明”和“钉子”也会被删除.

为此，本文提出了一种新颖的基于强化学习的远程监督NER方法，称为RLTL-DSNER（reinforcement learning and token level based distantly supervised named entity recognition）. 该方法可以从远程监督产生的噪声文本中准确识别正确实例，减少噪声实例对远程监督NER的负面影响. 具体而言，本文把强化学习框架中的策略网络中引入了标签置信度函数，为文本语句中的每个单词提供了标签置信分数. 此外，本文提出了一种NER模型预训练策略，即预训练阶段的F1分数达到85% ~ 95%时即停止训练. 该策略可以为强化学习的初始训练提供精准的状态表示和有效奖励值，帮助策略网络在训练初期以正确的方向更新其参数.

总的来说，本文的主要贡献有3点：

1）提出了一种新的基于强化学习的方法，用于解决远程监督NER任务，称为RLTL-DSNER.该方法利用策略网络与一个标签置信函数，从有噪声的远程监督数据中，以单词为单位识别正确实例，最大限度保留样本中的正确信息.

2）提出了一种NER模型预训练策略，以帮助RLTL-DSNER在训练初期就能以正确的方向更新其可学习参数，使训练过程稳定.

3）实验结果表明，RLTL-DSNER在3个中文数据集和1个英文医学数据集上都显著优于最先进的远程监督NER模型. 在NEWS数据集上，相较于现有最先进的方法，获得了4.28%的F1值提升.

1. 相关工作

传统的NER方法是基于人工标注的特征，常用的方法有最大熵^[15]、隐马尔可夫模型^[16]、支持向量机^[17]和条件随机场^[18]. 近年来，深度神经网络的发展使其成为研究的主流. 深度神经网络自动提取隐藏的特征，从而使研究人员不用再把重心放在特征工程中.

预训练语言模型BERT^[19]被提出后，以其动态词向量获取能力强、通用性强两大优点备受研究者关注，许多方法都以其作为编码器. Souza等人^[7]构建了BERT-CRF模型，在BERT的基础上，使用CRF层学习句子的约束条件，提升句子的整体标注效果. Hao等人^[8]使用了基于指针网络的模型结构，提升了模型对实体边界的敏感性，并解决了现实中普遍存在的重叠实体问题. 除了对模型架构的设计，许多研究将重点放在了额外特征的探索和挖掘中. 罗凌等人^[20]在模型中引入了包含汉字内部结构的笔画信息，Xu等人^[21]融合了中文文本中的词根、字符以及单词信息，这些额外特征的引入进一步提高了模型的表现.

虽然文献[7-8, 20-21]方法都在NER任务上取得了不错的效果，然而它们都依赖于大量的人工标注数据. 在缺乏人工标注数据的情况下，为了缓解数据不足带来的负面影响，许多研究者提出了远程监督标注方法. Shang等人^[22]提出了AutoNER模型，采用“Tie or Break”标注方案代替传统的BIO方案或BIOES方案. 同时，他们引入字典裁剪方法和高质量的短语来实现远程监督NER，并在3个基准数据集上取得了最先进的F1值. 继Shang等人^[22]之后，Wang等人^[23]在不完全字典的帮助下实现字符串匹配，以检测可能的实体. 此外，他们利用匹配实体和不匹配候选实体的上下文相似性来检测更多的实体. 相比常规仅使用精准字符串匹配生成自动标注的远程监督方法，通过词典拓展、匹配策略修改等方法，提高了数据质量. 然而，这些方法的效果好坏与他们使用的词典质量有密切关系. 在词典质量较差的情况下，依然无法避免自动标注产生的FNs与FPs这2类噪声标注.

针对噪声标注问题，主要有2类方法：

1）在训练过程中设计样本降噪策略来减小噪声对模型的负面影响. 高建伟等人^[24]利用外部知识图谱当中的结构化知识和文本语料中的语义知识，设计了一种实体知识感知的词嵌入表示方法，丰富句子级别的特征表达能力.Lange等人^[10]建议利用数据特征对输入实例进行聚类，然后为聚类计算不同的混淆矩阵.Peng等人^[25]将远程监督NER任务定义为正样本无标签学习问题，其中正样本由匹配的实体组成，非实体单词构成无标签数据. 为了扩展字典，他们使用修改的AdaSampling算法来迭代地检测可能的实体.Liang等人^[26]提出了一个2阶段框架，利用预训练模型的优势解决远程监督NER任务. 他们引入了一种自训练策略，将微调的BERT作为教师和学生模型，并使用教师模型生成的伪标签对学生模型进行训练.Li等人^[11]引入负采样以缓解噪声未标注实体的影响. 然而，这类方法仅能处理FNs噪声，仍无法解决FPs噪声.

2）在训练之前设计噪声过滤手段来删除训练集中的噪声样本. 由于噪声过滤过程的试错搜索与延迟反馈两大特征，许多研究者使用强化学习技术实现此类方法. 此类方法发挥了强化学习的强大决策能力，识别远程监督产生的噪声样本，一齐解决假阴性与假阳性实体问题.Qin等人^[27]使用关系抽取器的F1值作为策略网络的奖励. Feng等人^[28]使用关系提取器的预测概率计算奖励. 受其启发，一些研究人员^[13-14]将强化学习和CRF层的拓展Partial CRF结合起来完成远程监督NER的任务. 然而，他们的方法中，策略网络模型架构都较简单，仅使用MLP建模，识别能力较弱. 此外，都以完整的句子样本为单位进行识别，导致句子中的部分正确信息被丢弃.

2. 方法概述

本节首先给出问题的形式化定义，然后概述本文提出的基于强化学习的远程监督方法NER.

2.1 问题定义

NER通常被建模为序列标注任务，并使用BIO模式对样本进行标注. 给定文本 ${\mathcal{S}} = \left[{{{\textit{s}}_1},{{\textit{s}}_2}, … ,{s_n}} \right]$ ，其中 $n$ 表示 $S$ 中单词的数量，NER的目的是将标签序列 $T = \left[ {{t_1},{t_2}, … ,{t_n}} \right]$ 分配给 $S$ ，其中 ${t_i} \in \left\{ {{{\mathrm{B}}_X},{{\mathrm{I}}_X},{\mathrm{O}}} \right\}$ . B和I分别表示实体的首部和后续部分；X表示对应实体提及的类型；O表示该单词不属于任何类型的实体. 需要注意的是，类型往往是预先定义的. 与许多研究^{[13-14,29-30]}类似，本文NER任务的数据集包括少量人工标注的数据集合 $H$ 和大量通过远程监督获取的数据集合 $D$ . 具体数据量见表1.

表 1 数据集统计

Table 1. Statistics of Datasets

数据集	训练集		验证集条数	测试集条数
数据集	人工标注条数	远程监督条数	验证集条数	测试集条数
EC	1200	2500	400	800
NEWS	3000	3722	3328	3186
CCKS-DS	1723	5869	1024	2238
BC5CDR	4560	15000	4581	4797

下载: 导出CSV

| 显示表格

2.2 算法框架

如图2所示，本文提出的RLTL-DSNER模型主要包括2阶段：模型预训练阶段和迭代训练阶段.

图 2 RLTL-DSNER的主要框架

Figure 2. The main framework of RLTL-DSNER

下载: 全尺寸图片幻灯片

1）在模型预训练阶段，拟通过少量人工标注的数据来预训练NER模型，使得NER模型在训练集上的F1值达到某一阈值 $\alpha$ （ $\alpha$ 一般取值为85% ~ 95%）. 这一做法的目的是帮助NER模型在迭代训练阶段的初期为策略网络生成高质量的状态和奖励.

2）在迭代训练阶段，以深度强化学习作为框架，提出了单词级别的噪声检测模型. 具体而言，首先通过预训练的NER模型为文本数据生成向量表示和标签概率分布，并将两者作为状态输入到策略网络. 策略网络利用卷积神经网络（convolutional neural network, CNN）、标签置信函数以及多层感知器（multilayer perceptron, MLP）进行单词级别的噪声检测，判断文本数据中的各个单词是否被保留，如图2中删除了噪声实体“鸽子蛋”与“机械”，因为“鸽子蛋”算作一个产品而不是“鸽子”，“机械”算作描述产品“键盘”的规格，保留了正确实体“陈明亮”“键盘”“北京”. 随后，将保留的数据与人工标注的数据进行合并，联合训练NER模型. 同时，NER模型为保留的数据进行打分，并将其作为奖励来更新策略网络参数. 上述流程不断循环迭代，直到达到预定义的轮次.

3. NER模型预训练

在RLTL-DSNER中，NER模型主要用于状态与奖励的生成，其性能将会直接影响噪声检测结果. NER模型若不进行预训练，在迭代训练的初期往往无法为远程监督文本语句生成高质量的状态和奖励，可能导致策略网络被误导到错误的更新方向.

本文向EC数据集人工标注集合中手动添加噪声数据来研究深度神经网络的学习特性. 具体来说，本文将数据集合中一定比例数据的标注实体随机替换为其他实体，并将其视为噪声数据，其余数据视为干净数据. 图3展示了添加不同比例噪声情况下模型的训练情况.

图 3 人工往数据集中添加不同比例噪声后的训练情况

Figure 3. The training situation after artificially adding different proportions of noise to the dataset

下载: 全尺寸图片幻灯片

由图3可以看出，在训练过程中，模型在干净数据上的F1值会先得到大幅度提升，当干净数据上的F1值较大时，模型才会渐渐提升其在噪声数据上的F1值. 这个现象表明了深度神经网络在训练过程中通常先学习简单且通用的数据模式，然后逐渐强制拟合噪声数据. 换言之，模型的训练F1值达到某一阈值时，其在干净数据上的F1值较高，而在噪声数据上的F1值较低，此时模型将获得最佳性能. 因此，本文拟采用上述方法对NER模型进行预训练. 由于此阶段采用的数据集由人工标注，噪声较少，阈值 $\alpha$ 一般取值为85% ~ 95%.

给定人工标注数据集合 $H$ ，本文定义 $\left\{ {\left( {S_m^H,T_m^H} \right)} \right\}_{m = 1}^{{M_H}}$ 作为 $H$ 中的实例，其中 ${M_{{H}}}$ 表示集合大小，即包含的样本个数， $S_m^H$ 与 $T_m^H$ 分别表示集合 $H$ 中第 $m$ 个样本的文本和标签序列. 此外，假定NER模型用 $f\left( {\boldsymbol{\theta}} \right)$ 表示，其中 ${\boldsymbol{\theta }}$ 表示模型的参数，当 $f\left( {\boldsymbol{\theta}} \right)$ 拟合H中的实例的F1值达到阈值时，NER模型停止预训练.

上述预训练方式与早期停止（early stop）策略相似. 但两者不同之处在于早期停止是指当验证集上的损失值增加或训练集的F1值达到99.9%时，模型停止训练. 本文采用的预训练方法更像是“非常早期停止”. 相对于早期停止策略，本文的预训练方式有2点优势：

1）即使是人工标注的数据集，也难免存在噪声数据. 因此当训练F1值达到85% ~ 95%时，模型已经学到大部分的数据模式；而继续学习，只会强制记忆噪声数据，损害模型性能.

2）预训练过程仅有少量的数据样本，当模型训练到F1值达到99%时，很容易导致过拟合，降低了模型的泛化能力和噪声检测能力.

5.3节的实验表明，通过上述预训练方式的NER模型具有将正确样本和噪声样本分离的能力，有助于策略网络在迭代训练初期正确更新.

4. RLTL-DSNER中的强化学习方法

本节主要介绍RLTL-DSNER中的3个组件，即状态、动作和奖励. 与常规的基于强化学习的噪声过滤方法不同的是，RLTL-DSNER在策略网络中引入了一个标签置信函数，其结合噪声判定模型识别正确实例. 需要注意的是，实例的识别是单词级别的，而不是传统样本级别的.

4.1 状　态

由于训练数据中的输入句子是相互独立的，仅将句子的信息作为当前状态很难满足马尔可夫决策过程（Markov decision process, MDP）.RLTL-DSNER将通过NER模型获得的当前句子表示与标签概率进行拼接，以此作为强化学习智能体的状态. 需要注意的是NER模型是通过历史所选择的句子进行参数更新的. 换言之，第 $i$ 步的状态融入了前 $i - 1$ 步的状态与动作信息. 因此，RLTL-DSNER建模方式满足马尔可夫决策过程，即未来状态的条件概率分布仅依赖于当前状态，而与过去状态无关，因为过去状态的信息都已经隐式融入到当前状态了.

在RLTL-DSNER中，状态由2部分组成：当前文本的表示和其各个单词用远程监督标注标签的概率. 具体而言，给定文本 $S = [ {{s_1},{s_2}, … ,{s_n}} ]$ ，本文首先将S与特殊字符 $[ {{{\mathrm{cls}}} } ]$ 和 $[ {{{\mathrm{sep}}} } ]$ 进行拼接，即 $[ {{{\mathrm{cls}}} } ];S;[ {{{\mathrm{sep}}} } ]$ ，并输入到大规模预训练语言模型中（如BERT）. 其次，取语言模型中最后一层隐藏状态即 $\mathcal{{\boldsymbol{S}}} = ( {{{\boldsymbol{s}}_1},{{\boldsymbol{s}}_2}, … ,{{\boldsymbol{s}}_n}})$ 作为文本 $S$ 的语义表示，其中 ${{{\boldsymbol{s}}}}_{i} \left(i=1,2,\dots ,n\right)$ 是单词 ${s_i}$ 的隐藏状态. 针对各单词 ${s_i}$ 的标签概率，本文首先将上述的文本表示输入到全连接层中，为每个单词获取所有标签的概率即 ${\mathcal{{\boldsymbol{P}}}_{{s_i}}} = ( {{{p}_{{t_1}}},{{p}_{{t_2}}}, … ,{{p}_{{t_L}}}} )$ ，其中 $L$ 表示标签类型的数量， ${{p}_{{t_j}}}$ 表示 ${t_j}$ 是单词 ${s_j}$ 的标签的概率. 其次，根据上述的标签概率分布，为每个单词取出远程监督自动标注标签的概率. 因此，可得到文本中所有单词的标签概率，定义为 ${\boldsymbol{P}} = ( {{p_{{s_1}}},{p_{{s_2}}}, … ,{p_{{s_n}}}} )$ ，其中 ${p_{{s_i}}}$ 是单词 ${s_i}$ 的标签概率.

4.2 动　作

以往基于强化学习的噪声检测往往定义样本的取舍作为动作^[8,10,27-28]，但这会丢弃大量正确的实体信息. 因此，在RLTL-DSNER中，本文为文本中的每个单词定义一个动作 ${a_i} \in \left\{ {0,1} \right\}, \left( {i = 1,2, … ,n} \right)$ ，其中 ${a_i} = 0$ 表示丢弃当前单词， ${a_i} = 1$ 表示保留当前单词. 为了这一目标，本文设计了由2个组件组成的策略网络：噪声实体判别器和标签置信度（tag confidence, TC）函数.

噪声实体判别器是由CNN和MLP所构成，其输入是文本语句表示 $\mathcal{{\boldsymbol{S}}}$ 和其所有单词的标签概率 ${\boldsymbol{P}}$ ，输出是每个单词保留的概率. 这一过程形式化定义为

$\begin{split} & {\boldsymbol{\pi}} \left( {a|\mathcal{{\boldsymbol{S}}};{\boldsymbol{P}};{{\theta}} } \right) = prob\left( {a|\mathcal{{\boldsymbol{S}}};{\boldsymbol{P}};{{\theta }}} \right) =\\ &a\sigma \left( {\left( {\left( {{{\boldsymbol{W}}_{\rm{c}}} \otimes \mathcal{{\boldsymbol{S}}}} \right) \oplus {\boldsymbol{P}}} \right){{\boldsymbol{W}}_{\rm{m}}} + {\boldsymbol{b}}} \right) + \\ &\left( {1 - a} \right)\left( {1 - \sigma \left( {\left( {\left( {{{\boldsymbol{W}}_{\rm{c}}} \otimes \mathcal{{\boldsymbol{S}}}} \right) \oplus {\boldsymbol{P}}} \right){{\boldsymbol{W}}_{\rm{m}}} + {\boldsymbol{b}}} \right)} \right), \end{split}$

(1)

其中 ${{\boldsymbol{W}}_{\rm{c}}}$ 是卷积核的可学习参数， ${\mathrm{c}}$ 表示CNN网络， ${{\boldsymbol{W}}_{\rm{m}}}$ 和 ${\boldsymbol{b}}$ 是线性层的参数， ${\mathrm{m}}$ 表示MLP网络， $\sigma \left( \cdot \right)$ 是具有参数 ${{\theta}} =\left\{{{\boldsymbol{W}}}_{{\mathrm{c}}},{{\boldsymbol{W}}}_{\rm{m}},{\boldsymbol{b}}\right\}$ 的 $sigmoid$ 函数， $a_{ }\in\left\{0,1\right\}$ 表示动作， $\otimes$ 表示卷积运算， $\oplus$ 表示矩阵拼接运算. 整体运算流程为：文本语句表示 $\mathcal{{\boldsymbol{S}}}$ 和其所有单词的标签概率 ${\boldsymbol{P}}$ 作为噪声实体判别器的输入，先通过CNN对文本语句表示 $\mathcal{{\boldsymbol{S}}}$ 作卷积运算 $\otimes$ ，得到文本语句的整体表示；随后，将结果 $\left( {{{\boldsymbol{W}}_{\rm{c}}} \otimes \mathcal{{\boldsymbol{S}}}} \right)$ 与所有单词的标签概率 ${\boldsymbol{P}}$ 进行矩阵拼接，并通过线性层得到 $\left( {\left( {{{\boldsymbol{W}}_{\rm{c}}} \otimes \mathcal{{\boldsymbol{S}}}} \right) \oplus {\boldsymbol{P}}} \right){{\boldsymbol{W}}_{\rm{m}}} + {\boldsymbol{b}}$ ；最终将结果输入 $sigmoid$ 函数，得到每个单词的保留概率，即动作分别为0和1的概率.

通常情况下，仅使用噪声实体判别器是不充分的，原因有：在训练样本量少和数据不平衡的情况下，NER模型会倾向分配较高的概率给样本中出现次数较多的标签，分配较低的概率给出现次数较少的标签. 换言之，当数量较少的标签的预测概率有较大提升时，噪声实体判别器可能会选取另一频繁出现的标签（预测概率较高），而忽略标签概率的相对提升.

一种直接的做法是根据文本的长度进行归一化，凸显标签概率的相对提升. 然而，不同文本的长度是不一致的，导致无法定义统一的阈值进行单词的筛选. 因此，本文采用TC函数对单词标签归一化. 具体而言，给定一个批次的语句 $\left\{ {{S_1},{S_2}, … ,{S_m}} \right\}$ ，其中第 $i$ 条文本 ${S_i} = [ {{s_1},{s_2}, … ,{s_n}} ]$ ，本文首先定义单词 ${s_j}( j = 1,2, … , n )$ 的标签预测为 $l$ 的概率为 ${p_{i,j,l}}$ ，并定义 ${q_l}$ 为所有文本中各个单词标签预测为 $l$ 的概率的平方和，即

${q}_{l}=\displaystyle\sum _{i=1}^{m}\displaystyle\sum_{j=1}^{n}{p}_{i,j,l}^{2} ,\quad l=1,2,… ,L ,$

(2)

其中 $L$ 表示标签类型的数量.

然后，对同一批次中每个单词的标签预测概率，通过 ${q_l}$ 归一化，并取出所有标签中的最大值作为文本 ${S_i}$ 中第 $j$ 个单词 ${s_j}$ 的标签置信分数，定义为

$\begin{array}{*{20}{c}} {con{f_{{S_{i,j}}}} = \max \left( {\left[ {\dfrac{{p_{i,j,l}^2/{q_l}}}{{\displaystyle\sum\limits_{k = 1}^L {\left( {p_{i,j,k}^2/{q_k}} \right)} }}} \right]_{l = 1}^L} \right)} \end{array} .$

(3)

从本质上来说，该标签置信分数可看作归一化后的标签最大预测概率，本文通过上述手段进行归一化，为了削弱仅使用噪声实体判别器的不充分性，凸显标签概率的相对提升.

值得注意的是，本文在 ${q_l}$ 的定义以及归一化的过程中都对单词 ${s_j}$ 的标签预测概率 ${p_{i,j,l}}$ 取平方处理，由于概率的取值范围为 $\left[ {0,1} \right]$ ，且平方函数在该范围内的导数单调递增，有助于筛选高置信度单词，提高筛选质量.

对于每条文本，本文使用噪声实体判别器与TC函数确定是否保留文本中的每个单词：

$a_{i,j}=\left\{\begin{aligned} & 1,\quad conf_{S_{i,j}} > \varphi\; \mathrm{且}\; \pi_{i,j}\left(1|\mathcal{\boldsymbol{S}};\boldsymbol{P};\boldsymbol{\theta}\right) > 0.5, \\ & 0,\quad\mathrm{其他},\end{aligned}\right.$

(4)

其中 $\varphi$ 是预先设定的TC阈值.

展示了针对给定文本的动作选择，其中最终动作“0”表示丢弃该单词，“1”表示保留该单词. 通过远程监督对初始文本自动标注，生成人物实体“小明”与产品实体“包”“钉子”，在得到文本的句子表示和标签概率后，通过策略网络分别得到噪声实体判别器与TC函数的输出，并根据阈值筛选得到相应结果. 噪声实体判别器输出阈值为 $\phi=$ 0.5进行筛选，TC函数输出阈值自定义（中阈值 $\varphi$ =0.9）. 根据噪声实体判别器输出 ${\boldsymbol{\pi}}$ ，将丢弃单词“包”，根据TC函数输出 ${\boldsymbol{conf}}$ ，将丢弃单词“拖”“把”. 最终结合2个输出，得到最终动作为丢弃单词“包”“拖”“把”. 图4中可以看出，TC函数帮助识别出了噪声实体判别器无法筛选出的噪声实体，相比通常情况下仅使用噪声实体判别器进行筛选，增强了策略网络的噪声识别性能.

图 4 动作选择示例

Figure 4. An example of action selection

下载: 全尺寸图片幻灯片

4.3 奖　励

在策略网络的每次迭代中，当某一批次文本语句的所有动作执行完后，策略网络会接受以批次为单位的奖励. 该奖励r与NER模型的性能有关.

$\begin{array}{c}r=\dfrac{1}{\left|{\cal{B}}\right|}\displaystyle\sum_{S\in {\cal{B}}}\dfrac{1}{{\displaystyle \sum _{i=1}^{N}{a}_{i}}}\displaystyle\sum _{i=1}^{ N}\left({a}_{i}\;\mathrm{ln}\;{p}_{i}\left(T|S\right)\right)\end{array} \text{，}$

(5)

其中 $\mathcal{B}$ 表示一个批次的文本，即一次选取的所有文本， $S$ 表示批次中的任意文本，文本长度为 $N$ ， $i$ 表示文本中的单词下标， $T$ 表示标注序列，首先得到文本 $S$ 输入NER模型后，预测标签序列为标注序列 $T$ 的概率，并通过对该单词执行的动作 ${a_i} \in \left\{ {0,1} \right\}$ 来判断是否要将第 $i$ 个单词对应的值 ${p_i}\left( {T|S} \right)$ 加入计算， $\displaystyle \sum\limits _{i=1}^{ N}{a}_{i}$ 表示在句子层面，根据所选择单词的数量进行平均. 最终，根据批次大小 $\left| \mathcal{B} \right|$ 平均所有文本的反馈来获得最终奖励. 在式（5）定义下，模型保留单词的标注标签，预测概率越高，奖励越大，以此来衡量动作选择的正确程度. 策略网络由REINFORCE算法^[31]更新为：

$\begin{array}{*{20}{c}} {{\boldsymbol{\theta}} \leftarrow {\boldsymbol{\theta}} + \eta r\dfrac{\partial }{{\partial {\boldsymbol{\theta}} }}\ln {\boldsymbol{\pi}} \left( {a|\mathcal{{\boldsymbol{S}}};{\boldsymbol{P}};{\boldsymbol{\theta}} } \right)} \end{array} \text{，}$

(6)

其中 ${\boldsymbol{\theta}}$ 表示策略网络的可学习参数， $\eta$ 表示学习率，是一个超参数， $\dfrac{\partial }{{\partial {\boldsymbol{\theta}} }}$ 表示可学习参数 ${\boldsymbol{ \theta}}$ 的梯度， ${\boldsymbol{\pi}} \left( {a|\mathcal{{\boldsymbol{S}}};{\boldsymbol{P}};{{\theta }}} \right)$ 表示策略网络对文本语句表示 $\mathcal{{\boldsymbol{S}}}$ 和句中所有单词的标签概率 ${\boldsymbol{P}}$ 的输出结果.

5. 实　　验

本节首先介绍了数据集、基线模型、评估指标以及参数设置；随后，详细对比了不同模型在中英文数据集上的结果；最后，对模型进行详细分析，如进行消融实验和NER模型预训练，并给出案例分析.

5.1 实验设置

1）数据集. 本文拟采用3个中文数据集EC^[13]，NEWS^[13]，CCKS-DS和1个英文NER数据集BC5CDR^[32]. 下面详细介绍这4个数据集.

① EC是一个中文基准数据集，共有5种标签类型：品牌（pp）、产品（cp）、型号（xh）、原料（yl）和规格（gg）.

② NEWS是一个中文基准数据集. 该数据集由MSRA^[33]生成，只有一种实体类型：人名（PER）.

③ CCKS-DS由一个名为CCKS2017的开源中文临床数据集构建，它包含5种类型的医疗实体：检查和检验、疾病和诊断、症状和体征、治疗、身体部位.

本文从CCKS2017的数据集中提取了约1700个实例作为人工标注的训练集. 其余的大约5800个原始句子被收集为远程监督集，并通过远程监督方法进行标注. 远程监督使用的知识库为人工标注训练集中的所有特殊实体.

④ BC5CDR是一个英文生物医学领域基准数据集，它包含2种类型的实体：疾病（disease）和化学品（chemical）. 本文从Shang等人^[22]提供的原始文本库中选取了15000条文本，并使用其提供的词典对这些语料库进行远程监督自动标注.

这4个数据集的统计数据如表1所示，每个数据集都包含人工标注的小样本数据和远程监督生成数据.

2）基线模型. 本文共对比了DSNER^[13]，NER+PA+RL^[14]，LexiconNER^[25]，Span-based+SL^[34]，NegSampling-NER^[11]，NegSampling-variant^[12]，MTM-CW^[35]，BioFLAIR^[36]，Spark-Biomedical^[37]等方法.

① DSNER与NER+PA+RL都利用部分标注学习的方法来解决标签标注不完整的问题，并设计基于强化学习的实例选择器，以句子级别筛选噪声.

② LexiconNER将远程监督NER任务定义为正样本无标签学习问题，并使用自采样算法迭代地检测可能的实体，降低了对词典质量的要求.

③ NegSampling-NER在训练过程中采用负采样策略，以减少训练过程中未标记实体的影响.

④ NegSampling-variant在负采样的基础上，通过自适应加权抽样分布，处理错抽样和不确定性问题.

⑤ Span-based+SL采用跨度级特征来更新远程监督的字典.

⑥ MTM-CW通过一个可重用的BiLSTM层对字符级特征进行建模，并利用多任务模型的优势解决缺乏监督数据的问题.

⑦ BioFLAIR是一个使用额外的生物医学文本预训练而成的池化上下文嵌入模型.

⑧ Spark-Biomedical使用混合双向LSTM和CNN的模型架构，自动检测单词和字符级别的特征.

⑨ RLTL-DSNER（句子级别）是本文方法RLTL-DSNER的一个变体. 其基于本文提出的模型架构，以句子级别识别正确实例，TC函数修改为式（7），采用句子中各单词标签置信分数的最小值作为该句子的整体标签预测分数.

$\begin{array}{*{20}{c}} {con{f_S} = \mathop {\min }\limits_{{s_i}} \left( {con{f_{{s_i}}}} \right)} \end{array} .$

(7)

3）评估指标. 本文报告了3个评估指标：准确率（P）、召回率（R）和F1值（F1）. 需要注意的是仅当预测实体与标注实体完全匹配时，才将其视为正确实体. 在训练过程中，本文保存模型在验证集上F1最高的参数，并报告其在测试集上的各个指标.

4）参数设置. 对于每个数据集，本文采用相同的参数设置. 在第1阶段，训练的F1值限制为90%. 在第2阶段，优化器采用随机梯度下降；策略网络和NER模型的学习率均为 $1 \times {10^{ - 5}}$ ；每一网络层的Dropout设置为0.3，迭代次数设为80；式（4）中的置信度阈值 $\varphi$ 设置为0.9. 本文使用的标注方法为BIO标注.

对于BC5CDR数据集，本文使用“allenai/sciBERT-scivocab-uncased^[38]”作为预训练模型（PLM）. 对于其他数据集，PLM使用“BERT-base-chinese”. 报告的结果采用5次结果的平均值，以减少随机性.

5.2 模型对比

为了验证模型的有效性，本文拟在2个通用领域数据集EC和NEWS上进行实验. 实验结果如表2和表3所示. 从表2~3中可以得出3点结论：

表 2 EC数据集的主要结果

Table 2. Main Results on EC Dataset %

模型	F1	P	R
DSNER	61.45	61.57	61.33
NER+PA+RL	63.56	61.86	65.35
LexiconNER	61.22
Span-based+SL	65.70	67.55	63.94
NegSampling-NER	66.17
NegSampling-variant	67.03
RLTL-DSNER（本文，句子级别）	68.47	67.75	69.21
RLTL-DSNER（本文）	69.34	68.36	70.35

下载: 导出CSV

| 显示表格

表 3 NEWS数据集的主要结果

Table 3. Main Results on NEWS Dataset %

模型	F1	P	R
DSNER	79.22	76.95	81.63
NER+PA+RL	80.04	79.88	80.20
LexiconNER	77.98
Span-based+SL	85.23	85.63	84.84
NegSampling-NER	85.39
NegSampling-variant	86.15
RLTL-DSNER（本文，句子级别）	87.95	87.98	87.92
RLTL-DSNER（本文）	90.43	90.01	90.87

下载: 导出CSV

| 显示表格

1）本文提出的RLTL-DSNER获得了最好的性能. 特别地，RLTL-DSNER在EC数据集上获得了2.31个百分比的性能提升，并在NEWS数据集上获得了4.28个百分比的性能提升.

2）与句子级别的噪声过滤方法相比（如DSNER，NER+PA+RL），即使在句子级别的选择策略下，本文提出的噪声过滤方法都获得了更好的效果，说明策略网络中引入的TC函数的有效性.

3）RLTL-DSNER相较于RLTL-DSNER（句子级别）效果更好，说明以单词为单位识别正确实例可以最大限度保留样本4~5中的正确信息，提升模型性能.

此外，为了进一步验证模型的通用性，本文拟在CCKS-DS（中文）和BC5CDR（英文）2个医疗领域数据集中进行实验. 实验结果如表4和表5所示，从表4~5中可以得出2点结论：

表 4 CCKS-DS数据集的主要结果

Table 4. Main Results on CCKS-DS Dataset %

模型	F1	P	R
NER+PA+RL	78.38	79.56	77.23
NegSampling-NER	82.72	83.21	82.24
RLTL-DSNER（句子级别）	83.97	79.76	88.66
RLTL-DSNER	84.97	81.47	88.77

下载: 导出CSV

| 显示表格

表 5 BC5CDR数据集的主要结果

Table 5. Main Results on BC5CDR Dataset %

模型	F1	P	R
MTM-CW	88.78	89.10	88.47
NER+PA+RL	88.01	87.00	89.04
BioFLAIR	89.42
Spark-Biomedical	89.73
RLTL-DSNER（句子级别）	88.92	88.72	89.13
RLTL-DSNER	90.21	89.64	90.78

下载: 导出CSV

| 显示表格

1）无论是在中文数据集还是英文数据集，RLTL-DSNER在F1值上达到了新的SOTA，说明了该模型具有良好的语言适配性.

2）本文的RLTL-DSNER相较于医学领域的模型，如BioFLAIR，Spark-Biomedical依然获得了小幅度的F1值提升，说明该模型具有较好的领域适配性.

5.3 数据分析

本节拟通过消融实验来验证模型每一模块的有效性，并进一步验证预训练方式的有效性.

1）消融实验. 本节将在4个数据集上进行消融实验. 实验条件设置为：

① 不使用RL框架，只利用人工标注的数据集作为训练集来训练NER模型，记为“baseline: H”；

② 使用人工标注和远程监督的数据集作为训练集，而不利用RL框架，记为“baseline: H+D”；

③ 不采用预训练策略，即训练阶段在人工数据集上的F1值达到近100%才进入第2阶段的迭代训练，记为“w/o HT”.

实验结果如表6所示，从表6中得出2点结论：

表 6 消融实验

Table 6. Ablation Study %

模型	数据集	F1	P	R
baseline:H	EC	68.03	67.11	68.97
baseline:H+D		63.15	66.95	59.76
w/o HT		68.81	68.32	69.30
RLTL-DSNER（本文）		69.34	68.36	70.35
baseline:H	NEWS	87.34	87.09	87.58
baseline:H+D		81.86	84.28	79.58
w/o HT		88.73	88.43	89.04
RLTL-DSNER（本文）		90.43	90.01	90.87
baseline:H	CCKS-DS	80.25	75.63	85.47
baseline:H+D		70.85	63.33	80.39
w/o HT		83.95	80.75	87.42
RLTL-DSNER（本文）		84.97	81.47	88.77
baseline:H	BC5CDR	86.47	84.40	88.65
baseline:H+D		87.79	88.03	87.55
w/o HT		89.77	88.58	91.00
RLTL-DSNER（本文）		90.21	89.64	90.78

下载: 导出CSV

| 显示表格

① 在4个数据集上，RLTL-DSNER模型在所有指标上都取得了最佳的效果，说明模型中的每一模块（包括NER模型预训练、远程监督数据和单词级别的噪声检测）都是十分重要的.

② 在3种基线中，baseline:H+D模型的效果是最差的，说明远程监督自动生成数据中存在大量的噪声实例. 特别地，在CCKS-DS数据集中F1值下降了9.40个百分比. 而在BC5CDR数据集上，F1值获得了1.32个百分比的提升，这是由于本文使用了Shang等人^[22]提供的词典进行自动标注，词典质量较高，噪声较少，因此并没有很大程度影响模型的性能.

2）预训练NER模型的有效性. 为了说明本文采用预训练方式的有效性，将NER模型的F1值训练到90%的方式，拟与将模型的F1值训练到近100%的方式进行对比. 这2种方式的F1值是迭代训练过程中的前20个迭代次数在测试集上进行测试得到的. 实验结果如图5所示，从图5中可以得出2点结论：

图 5 不同策略下相同模型的初期训练表现

Figure 5. Initial training performance of the same model under different strategies

下载: 全尺寸图片幻灯片

① 使用本文的预训练方式，RL模型的训练较为稳定，仅在NEWS数据集上出现小幅度的性能下降. 这说明了该预训练方式避免了模型的过拟合现象，可以为RL模型在训练初期提供高质量的文本表示和反馈奖励.

② 将NER模型训练到近100%的情况下，RL的训练过程十分不稳定. 在4个数据集上都出现了十分严重的性能下降，在EC，NEWS，BC5CDR数据集上，经过5个迭代次数后训练趋于稳定，而在CCKS-DS数据集上，模型直至10个迭代次数后训练才逐渐稳定. 这是由于NER模型对小样本的人工标注数据集过拟合，记住了许多训练样本. 此外，模型也学习到了人工标注数据集中难免存在的部分标注噪声. 因此导致其生成的句子表示和奖励质量不高.

5.4 案例分析

本节拟通过具体的数据实例与模型预测结果，进一步说明本文提出的RLTL-DSNER的有效性.

图6显示了远程监督数据中噪声检测的7个示例，动作一栏表示在句子级别的动作选择策略下模型的输出结果，动作“0”表示丢弃该句子，动作“1”表示保留该句子.

图 6 远程监督数据的实例选择示例

Figure 6. Instances selection examples for the distantly supervised data

下载: 全尺寸图片幻灯片

从图6可以看出，本文提出的模型准确识别出了FNs如“梁连起（人名）”“等大等圆（症状和体征）”“全脂（产品）”“农夫山泉（品牌）”“天然（产品）”，FPs如“金灿灿（无类型）”“面色（无类型）”. 这些示例表明，本文的方法能够精准地在单词级别进行噪声检测，选择正确的实体，并丢弃有噪声的实体，最大限度保留样本中的正确信息.

此外，根据相同示例下句子级别选择策略的预测结果，可以看出在此策略下会丢弃许多正确信息，如第5个句子中的“纯牛奶（产品）”、第7个句子中的“矿泉水（产品）”，同时会使模型学习到许多噪声信息，如第1个句子中的“梁连起”、第4个句子中的“面色（身体部位）”等，降低了模型性能.

图7展示了3个中文数据集中部分人工标注实例，可以看到“厨房纸（产品）”“王太守则（人名）”“肠管（身体部位）”“干湿性啰音（检查和检验）”这些实体并没有被标注出. 此现象说明了人工标注数据集耗时耗力，工作量庞大，但是依然无法避免小部分由于人为疏漏或标注人员间判断标准的差异引入的噪声实体，再次证明了我们提出的NER模型预训练策略的有效性.

图 7 人工标注数据示例

Figure 7. Instances of manual annotation data

下载: 全尺寸图片幻灯片

6. 结　　论

本文提出了一种解决远程监督NER任务中噪声标注问题的新方法RLTL-DSNER. 其在强化学习框架中的策略网络引入了TC函数，为文本语句中的每个单词提供了标签置信分数，并使用单词级别的实例选择策略以最大限度保留样本中的正确信息，减少噪声实例对远程监督NER的负面影响. 此外，本文提出了一种NER模型预训练策略，该策略可以为强化学习的初始训练提供精准的状态表示和有效奖励值，帮助策略网络在训练初期以正确的方向更新其参数. 在3个中文数据集和1个英文医学数据集上的大量实验结果验证了RLTL-DSNER的优越性，在NEWS数据集上，相较于现有最先进的方法，获得了4.28%的F1值提升.

作者贡献声明：王嘉诚和王凯完成了算法思路设计、实验方案制定，并完成实验和论文撰写工作；王昊奋提供论文撰写指导、技术支持；杜渂和何之栋完成了相关文献梳理、实验数据整理，并讨论方案；阮彤完成了论文框架设计、整体内容规划；刘井平提供论文撰写指导和完善实验方案.

图 1 pix2pix 和 CycleGAN 的原理示意图

Figure 1. Illustration of principle of pix2pix and CycleGAN

下载: 全尺寸图片幻灯片

图 2 Faceshifter网络结构示意图^[21]

Figure 2. Schematic diagram of the network structure of Faceshifter^[21]

下载: 全尺寸图片幻灯片

图 3 Tacotron2 系统结构^[79]

Figure 3. Systematic structure of Tacotron2^[79]

下载: 全尺寸图片幻灯片

图 4 多注意力深度伪造检测方法结构图^[140]

Figure 4. Structure diagram of multi-attention deepfake detection method^[140]

下载: 全尺寸图片幻灯片

图 5 深度伪造对抗样本生成方法的网络结构^[158]

Figure 5. Network structure of deepfake adversarial examples generation method^[158]

下载: 全尺寸图片幻灯片

表 1 图像和视频伪造检测方法总结

Table 1 Summary of Image and Video Fake Detection Methods

检测方法	特点	适用场景	实验数据集	检测性能	模型主干网络
Exploiting Visual Artifact^[92]	通过提取牙齿，眼睛及脸部轮廓等特征进行伪造检测	使用Deepfakes方法和face2face方法生成的深度伪造视频	FaceForensics	0.866（AUC）	逻辑回归、多层感知机
FDFL^[95]	使用频域特征，优化难度小	检测面部替换，面部重现等伪造图片和视频	FaceForensics++^[12]	0.994（ACC）0.997（AUC）	CNN
Generalizing Face Forgery Detection^[96]	利用图像的高频噪声，泛化能力较强	针对未知伪造方法生成图像的检测，需要高泛化性检测方法的场景	FaceForensics++^[12]	0.994（AUC）	CNN、注意力机制
Face x-ray^[99]	较高的泛化性	需要高泛化性检测方法的场景	FaceForensics++^[12]，DFDC^[148]，celebDF^[149]	0.985（AUC，FF++），0.806（泛化AUC，celebDF），0.809（泛化AUC，DFDC）	CNN
LRNet^[106]	通过帧间时序特征识别伪造视频，同时有较强的鲁棒性	针对存在压缩和破损等情况的深度伪造视频检测	FaceForensics++^[12] celebDF^[149]	0.957（AUC，FF++，c40压缩）0.554（AUC，celebDF，c40压缩）	CNN+RNN
Exposing Inconsistent Head Poses^[110]	通过检测人物头部姿态判断是否为伪造视频	深度伪造视频检测	自建数据集	0.974（AUC）	SVM
F₃-Net^[116]	基于频域特征的深度伪造检测	被压缩的伪造视频检测	FaceForensics++^[12]	0.958（AUC）	CNN
Two-branch Recurrent Network^[117]	融合了RGB域信息和频域的高频信息	深度伪造视频检测	FaceForensics++^[12]，DFDC^[148]，celebDF^[149]	0.987（AUC，单帧），0.991（AUC，视频）	CNN+LSTM
Id-reveal^[122]	通过比对待测视频和参考视频中人脸身份信息判断伪造	拥有指定人物参考视频的深度伪造视频检测	DFD^[150]	0.86（AUC）	CNN
Emotions Don’t Lie^[127]	通过提取多模态情感信息之间的差异来检测伪造	带有音频的深度伪造视频检测	DF-TIMIT^[151]，DFDC^[148]	0.844（AUC，DFDC）	CNN
Fakespotter^[131]	通过神经网络可解释性方法检测伪造视频	针对GAN等生成模型的深度伪造检测	Celeb-DF v2^[152]	0.668（AUC）	深度人脸识别模型
On the Detection of Digital Face Manipulation ^[132]	基于注意力机制的深度伪造检测	需要可视化伪造区域的检测场景	自建数据集	0.997（AUC）	CNN、注意力机制
FReTal^[137]	通过知识蒸馏和迁移学习，解决针对新出现的伪造方法的检测	适用于检测较新的伪造生成方法	FaceForensics++^[12]	0.925（泛化AUC）	CNN
Multi-attentional deepfake detection^[140]	聚合高维的语义信息和低维的纹理信息	图像和视频深度伪造检测	FaceForensics++^[12]，DFDC^[148]，celebDF^[149]	0.993（AUC，FF++）	CNN、注意力机制
CviT^[145]	引入视觉transformer检测深度伪造	图像和视频深度伪造检测	FaceForensics++^[12]，DFDC^[148]	0.915（ACC，DFDC）	CNN+视觉transformer

下载: 导出CSV

表 2 深度伪造视频和图片数据集

Table 2 Deepfake Video and Image Datasets

数据集	发布年份	伪造方法	数据集描述	数据集大小	真伪样本数量比
DFD^[150]	2019	Deepfakes	篡改视频均使用 C₀，C₂₃， C₄₀ 这3种压缩方式	363个原始视频、3068个篡改视频、28个演员和16个不同场景	1∶8.45
Deepfake-TIMIT^[151]	2018	FaceSwap-GAN	从VidTIMIT数据库中选取相近人脸伪造构建	320个视频、每个视频有高清（128×128）和低清（64×64）版本	1∶1
DFDC（deepfake detection challenge）Preview^[148]	2019	未知	DFDC预赛中使用的数据集	5214个视频	1∶3.57
DFDC^[170]	2020	8种伪造方法	DFDC比赛中使用的数据集	119154个视频	1∶5.26
FaceForensics++ （FF++）^[12]	2019	Deepfakes，FaceSwap，Face2face，Neuraltexture，faceshifter	Google推出的另一个数据集，前身为FaceForensics，目前仍在持续更新	6000个视频	1∶5
Celeb-DF^[149]	2020	Deepfakes	视频数量较少，已有后续版本Celeb-DF v2^[152]和DFGC（deepfake game competition）^[171]	590个真实视频、5639个伪造视频	1∶9.56
Wild Deepfake^[172]	2020	网络途径获取	通过网络获取的伪造数据集，效果较好	707个伪造视频、100个演员
DeeperForensics 1.0^[173]	2020	deepfake-VAE	大型深度伪造数据集，包含多种灯光条件和面部角度，同时使用了改进的生成方法，较之前数据集更为真实	60000个视频、1760万帧	1∶5
Video Forensics HQ^[174]	2020	Neural Textures	高清视频伪造数据集
FFIW-10K^[175]	2021	3种合成方法	同一个视频片段中出现多个可能被篡改的人脸，平均每帧3.15个人脸	10000个真实视频和10000个篡改视频	1∶1
ForgeryNet^[176]	2021	15种合成方法（7种图像级方法、8种视频级方法）	支持多种任务的超大数据集（630万个分类标签、290万个操纵区域标注和221247个时空伪造段标签）	290万张图像、221247个视频	视频1∶1.22，图片1∶1.01
FakeAVCeleb^[177]	2021	5种伪造方法	多模态数据集、伪造视频包含音频	25500个视频	1∶51.02

下载: 导出CSV

表 3 ForgeryNet中的定位伪造区域的检测任务^[176]

Table 3 Spatial Forgery Localization Detection Task in ForgeryNet^[176]

伪造方法	图片样本	检测结果标注
面部替换
面部重现
面部特征编辑
真实人脸

下载: 导出CSV

表 4 深度伪造语音数据集

Table 4 Deepfake Audio Datasets

数据集	发布年份	描述	数据集大小
ASVspoof 2015^[178]	2015	语音合成与转换	16651 段原始音频、246500 段合成转换音频
ASVspoof 2017^[179]	2017	录音重放	3566段非重放音频、14466段重放音频
ASVspoof 2019^[180]	2019	录音重放、语音合成与转换	15928 原始音频、117996合成转换音频

下载: 导出CSV

参考文献(181)

[1]	Mirsky Y, Lee W. The creation and detection of deepfakes: A survey[J]. ACM Computing Surveys, 2021, 54(1): 264−263
[2]	Kingma D P, Welling M. Auto-encoding variational Bayes[J]. arXiv preprint, arXiv: 1312.6114, 2013
[3]	Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[C] //Proc of the 27th Int Conf on Neural Information Processing Systems. La Jolla, CA : NIPS, 2014: 2672−2680
[4]	Isola P, Zhu Junyan, Zhou Tinghui, et al. Image-to-image translation with conditional adversarial networks[C] //Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 1125−1134
[5]	Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation[C] //Proc of the 18th Int Conf on Medical Image Computing and Computer-assisted Intervention. Berlin: Springer, 2015: 234−241
[6]	Wang Tingchun, Liu Mingyu, Zhu Yanjun, et al. High-resolution image synthesis and semantic manipulation with conditional GANs[C] //Proc of the 31st IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 8798−8807
[7]	Wang Tingchun, Liu Mingyu, Zhu Yanjun, et al. Video-to-video synthesis[J]. arXiv preprint, arXiv: 1808.06601, 2018
[8]	Zhu Junyan, Park T, Isola P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks[C] //Proc of the 30th IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 2223−2232
[9]	Huang Gao, Liu Zhuang, Van Der Maaten L, et al. Densely connected convolutional networks[C] //Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 4700−4708
[10]	He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition[C] //Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 770−778
[11]	Chollet F. Xception: Deep learning with depthwise separable convolutions [C] //Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 1251−1258
[12]	Rossler A, Cozzolino D, Verdoliva L, et al. FaceForensics++: Learning to detect manipulated facial images [C] //Proc of the 17th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 1−11
[13]	Dale K, Sunkavalli K, Johnson M K, et al. Video face replacement [J]. ACM Transactions on Graphics, 2011, 30(6): 8: 1−8: 10
[14]	torzdf. Deepfakes [CP/OL] 2017 [2021-10-15]. https://github.com/deepfakes/face swap
[15]	Korshunova I, Shi Wenzhe, Dambre J, et al. Fast face-swap using convolutional neural networks[C] //Proc of the 16th IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 3677−3685
[16]	Ulyanov D, Lebedev V, Vedaldi A, et al. Texture networks: Feed-forward synthesis of textures and stylized images[C] //Proc of the 33rd Int Conf on Machine Learning. New York: PMLR, 2016: 1349− 1357
[17]	Shaoanlu. Fceswap-GAN [CP/OL]. 2017 [2021-10-15]. https://github.com/shaoa nlu/faceswap-GAN
[18]	Natsume R, Yatagawa T, Morishima S. FsNet: An identity-aware generative model for image-based face swapping[C] //Proc of the 14th Asian Conf on Computer Vision. Berlin: Springer, 2018: 117−132
[19]	Natsume R, Yatagawa T, Morishima S. RSGAN: Face swapping and editing using face and hair representation in latent spaces[J]. arXiv preprint, arXiv: 1804.03447, 2018.
[20]	Nirkin Y, Keller Y, Hassner T. FSGAN: Subject agnostic face swapping and reenactment[C] //Proc of the 17th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 7184−7193
[21]	Li Lingzhi, Bao Jianmin, Yang Hao, et al. Faceshifter: Towards high fidelity and occlusion aware face swapping[J]. arXiv preprint, arXiv: 1912.13457, 2019
[22]	Chen Renwang, Chen Xuanhong, Ni Bingbing, et al. Simswap: An efficient framework for high fidelity face swapping[C] //Proc of the 28th ACM Int Conf on Multimedia. New York: ACM, 2020: 2003−2011
[23]	Zhu Yuhao, Li Qi, Wang Jian, et al. One shot face swapping on megapixels [C] //Proc of the 18th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 4834−4844
[24]	Lin Yuan, Lin Qian, Tang Feng, et al. Face replacement with large-pose differences[C] //Proc of the 20th ACM Int Conf on Multimedia. New York: ACM, 2012: 1249−1250
[25]	Min Feng, Sang Nong, Wang Zhefu. Automatic face replacement in video based on 2D morphable model[C] //Proc of the 20th Int Conf on Pattern Recognition. Piscataway, NJ: IEEE, 2010: 2250−2253
[26]	Moniz J R A, Beckham C, Rajotte S, et al. Unsupervised depth estimation, 3D face rotation and replacement[J]. arXiv preprint, arXiv: 1803.09202, 2018
[27]	Thies J, Zollhofer M, Niessner M, et al. Real-time expression transfer for facial reenactment[J]. ACM Transactions on Graphics, 2015, 34(6): 183: 1−183: 4
[28]	Thies J, Zollhofer M, Stamminger M, et al. Face2Face: Real-time face capture and reenactment of rgb videos[C] //Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 2387−2395
[29]	Thies J, Zollhofer M, Theobalt C, et al. Headon: Real-time reenactment of human portrait videos[J]. ACM Transactions on Graphics, 2018, 37(4): 164: 1−164: 13
[30]	Kim H, Garrido P, Tewari A, et al. Deep video portraits[J]. ACM Transactions on Graphics, 2018, 37(4): 163: 1−163: 14
[31]	Nagano K, Seo J, Xing Jun, et al. PaGAN: Real-time avatars using dynamic textures[J]. ACM Transactions on Graphics (TOG), 2018, 37(6): 258: 1−258: 12
[32]	Geng Jiahao, Shao Tianjia, Zheng Youyi, et al. Warp-guided GANs for single-photo facial animation[J]. ACM Transactions on Graphics, 2018, 37(6): 231: 1−231: 12
[33]	Wang Yaohui, Bilinski P, Bremond F, et al. Imaginator: Conditional spatio-temporal GAN for video generation[C] //Proc of the 20th IEEE/CVF Winter Conf on Applications of Computer Vision. Piscataway, NJ: IEEE, 2020: 1160−1169
[34]	Siarohin A, Lathuiliere S, Tulyakov S, et al. Animating arbitrary objects via deep motion transfer[C] //Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 2377−2386
[35]	Siarohin A, Lathuiliere S, Tulyakov S, et al. First order motion model for image animation[C] //Proc of the 32nd Int Conf on Neural Information Processing Systems. La Jolla, CA : NIPS, 2019: 7137−7147
[36]	Qian Shengju, Lin K Y, Wu W, et al. Make a face: Towards arbitrary high fidelity face manipulation[C] //Proc of the 32nd IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 10033−10042
[37]	Song Linsen, Wu W, Fu Chaoyou, et al. Pareidolia face reenactment[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 2236−2245
[38]	Pumarola A, Agudo A, Martinez A M, et al. GANimation: Anatomically-aware facial animation from a single image[C] //Proc of the 15th European Conf on Computer Vision (ECCV). Berlin: Springer, 2018: 818−833
[39]	Tripathy S, Kannala J, Rahtu E. FACEGAN: Facial attribute controllable reenactment gan[C] //Proc of the 21st IEEE/CVF Winter Conf on Applications of Computer Vision. Piscataway, NJ: IEEE, 2021: 1329−1338
[40]	Gu Kuangxiao, Zhou Yuqian, Huang T. FLNet: Landmark driven fetching and learning network for faithful talking facial animation synthesis[C] //Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 10861−10868
[41]	Xu Runze, Zhou Zhiming, Zhang Weinan, et al. Face transfer with generative adversarial network[J]. arXiv preprint, arXiv: 1710.06090, 2017
[42]	Bansal A, Ma Shugao, Ramanan D, et al. RecycleGan: Unsupervised video retargeting[C] //Proc of the 15th European Conf on Computer Vision (ECCV). Berlin: Springer, 2018: 119−135
[43]	Wu W, Zhang Yunxuan, Li Cheng, et al. ReenactGAN: Learning to reenact faces via boundary transfer[C] //Proc of the 15th European Conf on Computer Vision (ECCV). Berlin: Springer, 2018: 603−619
[44]	Zhang Jiangning, Zeng Xianfang, Wang Mengmeng, et al. FReeNet: Multi-identity face reenactment[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 5326−5335
[45]	Zhang Jiangning, Zeng Xianfang, Pan Yusu, et al. FaceSwapNet: Landmark guided many-to-many face reenactment[J]. arXiv preprint, arXiv: 1905.11805, 2019
[46]	Tripathy S, Kannala J, Rahtu E. ICface: Interpretable and controllable face reenactment using GANs[C] //Proc of the 20th IEEE/CVF Winter Conf on Applications of Computer Vision. Piscataway, NJ: IEEE, 2020: 3385−3394
[47]	Wiles O, Koepke A, Zisserman A. X2Face: A network for controlling face generation using images, audio, and pose codes[C] //Proc of the 15th European Conf on Computer Vision (ECCV). Berlin: Springer, 2018: 670−686
[48]	Shen Yujun, Luo Ping, Yan Junjie, et al. Faceid-GAN: Learning a symmetry three-player GAN for identity-preserving face synthesis[C] //Proc of the 31st IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 821−830
[49]	Shen Yujun, Zhou Bolei, Luo Ping, et al. FaceFeat-GAN: A two-stage approach for identity-preserving face synthesis[J]. arXiv preprint, arXiv: 1812.01288, 2018
[50]	Wang Tingchun, Liu Mingyu, Tao A, et al. Few-shot video-to-video synthesis[J]. arXiv preprint, arXiv: 1910.12713, 2019
[51]	Zakharov E, Shysheya A, Burkov E, et al. Few-shot adver-sarial learning of realistic neural talking head models[C] //Proc of the 32nd IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 9459−9468
[52]	Burkov E, Pasechnik I, Grigorev A, et al. Neural head reenactment with latent pose descriptors[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 13786−13795
[53]	Ha S, Kersner M, Kim B, et al. MarioNETte: Few-shot face reenactment preserving identity of unseen targets[C] //Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 10893−10900
[54]	Hao Hanxiang, Baireddy S, Reibman A R, et al. Far-GAN for one-shot face reenactment[J]. arXiv preprint, arXiv: 2005.06402, 2020
[55]	Fried O, Tewari A, Zollhofer M, et al. Text-based editing of talking-head video[J]. ACM Transactions on Graphics, 2019, 38(4): 68: 1−68: 14
[56]	Kumar R, Sotelo J, Kumar K, et al. ObamaNet: Photo-realisticlip-sync from text[J]. arXiv preprint, arXiv: 1801.01442, 2017
[57]	Sotelo J, Mehri S, Kumar K, et al. Char2wav: End-to-end speech synthesis[C] //Proc of the ICLR 2017 Workshop. 2017: 24−26
[58]	Jamaludin A, Chung J S, Zisserman A. You said that?: Synthesising talking faces from audio[J]. International Journal of Computer Vision, 2019, 127(11): 1767−1779
[59]	Vougioukas K, Petridis S, Pantic M. Realistic speech-driven facial animation with GANs[J]. International Journal of Computer Vision, 2020, 128(5): 1398−1413 doi: 10.1007/s11263-019-01251-8
[60]	Suwajanakorn S, Seitz S M, Kemelmacher-shlizerman I. Synthesizing Obama: Learning lip sync from audio[J]. ACM Transactions on Graphics, 2017, 36(4): 95: 1−95: 13
[61]	Chen Lele, Maddox R K, Duan Zhiyao, et al. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss[C] //Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 7832−7841
[62]	Zhou Hang, Liu Yu, Liu Ziwei, et al. Talking face generation by adversarially disentangled audio-visual representation[C] //Proc of the 33rd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2019: 9299−9306
[63]	Thies J, Elgharib M, Tewari A, et al. Neural voice puppetry: Audio-driven facial reenactment[C] //Proc of the 16th European Conf on Computer Vision (ECCV). Berlin: Springer, 2020: 716−731
[64]	Hannun A, Case C, Casper J, et al. DeepSpeech: Scaling up end-to-end speech recognition[J]. arXiv preprint, arXiv: 1412.5567, 2014
[65]	Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks[C] //Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 4401−4410
[66]	Karras T, Laine S, Airtala M, et al. Analyzing and improving the image quality of StyleGAN[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 8110−8119
[67]	Karras T, Aittala M, Laine S, et al. Alias-free generative adversarial networks[J]. arXiv preprint, arXiv: 2106.12423, 2021
[68]	Choi Y, Choi M, Kim M, et al. StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation[C] //Proc of the 31st IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 8789−8797
[69]	Choi Y, Uh Y, Yoo J, et al. StarGAN v2: Diverse image synthesis for multiple domains[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 8188−8197
[70]	Sanchez E, Valstar M. Triple consistency loss for pairing distributions in GAN-based face synthesis[J]. arXiv preprint, arXiv: 1811.03492, 2018
[71]	Kim D, Khan M A, Choo J. Not just compete, but collaborate: Local image-to-image translation via cooperative mask prediction[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 6509−6518
[72]	Li Xinyang, Zhang Shengchuan, Hu Jie, et al. Image-to-image translation via hierarchical style disentanglement[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 8639−8648
[73]	Aberman K, Shi Mingyi, Liao Jing, et al. Deep video-based performance cloning[J]. Computer Graphics Forum, 2019, 38(2): 219−233 doi: 10.1111/cgf.13632
[74]	Chan C, Ginosar S, Zhou Tinghui, et al. Everybody Dance Now [C] //Proc of the 32nd IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 5933−5942
[75]	Liu Lingjie, Xu Weipeng, Zollhofer M, et al. Neural rendering and reenactment of human actor videos[J]. ACM Transactions on Graphics, 2019, 38(5): 139: 1−139: 14
[76]	Tokuda K, Nankaku Y, Toda T, et al. Speech synthesis based on hidden Markov models[J]. Proceedings of the IEEE, 2013, 101(5): 1234−1252 doi: 10.1109/JPROC.2013.2251852
[77]	Oord A, Dieleman S, Zen H, et al. WaveNet: A generative model for raw audio[J]. arXiv preprint, arXiv: 1609.03499, 2016
[78]	Wang Yuxuan, Skerry-ryan R, Stanton D, et al. Tacotron: A fully end-to-end text-to-speech synthesis model[J]. arXiv preprint, arXiv: 1703.10135, 2017
[79]	Shen J, Pang Ruoming, Weiss R J, et al. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions[C] //Proc of the 43rd IEEE Int Conf on Acoustics, Speech and Signal Processing(ICASSP). Piscataway, NJ: IEEE, 2018: 4779−4783
[80]	Fu Ruibo, Tao Jianhua, Wen Zhengqi, et al. Focusing on attention: Prosody transfer and adaptative optimization strategy for multi-speaker end-to-end speech synthesis[C] //Proc of the 45th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 6709−6713
[81]	Kumar K, Kumar R, de Boissiere T, et al. MelGAN: Generative adversarial networks for conditional waveform synthesis[J]. arXiv preprint, arXiv: 1910.06711, 2019.
[82]	Yang Geng, Yang Shan, Liu Kai, et al. Multi-band melgan: Faster waveform generation for high-quality text-to-speech[C] //Proc of the 8th IEEE Spoken Language Technology Workshop (SLT). Piscataway, NJ: IEEE, 2021: 492−498
[83]	Kaneko T, Kameoka H. CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks[C] //Proc of the 27th European Signal Processing Conf (EUSIPCO). Piscataway, NJ: IEEE, 2018: 2100−2104
[84]	Kaneko T, Kameoka H, Tanaka K, et al. CycleGAN-VC2: Improved cyclegan-based non-parallel voice conversion[C] //Proc of the 44th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 6820−6824
[85]	Kaneko T, Kameoka H, Tanaka K, et al. CycleGAN-VC3: Examining and improving CycleGAN-VCs for mel-spectrogram conversion[J]. arXiv preprint, arXiv: 2010.11672, 2020
[86]	Kameoka H, Kaneko T, Tanaka K, et al. StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks[C] //Proc of the 7th IEEE Spoken Language Technology Workshop (SLT). Piscataway, NJ: IEEE, 2018: 266−273
[87]	Kaneko T, Kameoka H, Tanaka K, et al. StarGAN-VC2: Rethinking conditional methods for StarGAN-based voice conversion[J]. arXiv preprint, arXiv: 1907.12279, 2019
[88]	Liu Ruolan, Chen Xiao, Wen Xue. Voice conversion with transformer network[C] //Proc of the 45th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 7759−7759
[89]	Luong H T, Yamagishi J. Bootstrapping non-parallel voice conver-sion from speaker-adaptive text-to-speech[C] //Proc of the 16th IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Piscataway, NJ: IEEE, 2019: 200−207
[90]	Zhang Mingyang, Zhou Yi, Zhao Li, et al. Transfer learning from speech synthesis to voice conversion with non-parallel training data[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29(1): 1290−1302
[91]	Huang Wenqin, Hayashi T, Wu Yiqiao, et al. Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining[J]. arXiv preprint, arXiv: 1912.06813, 2019
[92]	Matern F, Riess C, Stamminger M. Exploiting visual artifacts to expose deepfakes and face manipulations[C] //Proc of the 20th IEEE Winter Applications of Computer Vision Workshops (WACVW). Piscataway, NJ: IEEE, 2019: 83−92
[93]	Zhou Peng, Han Xintong, Morariu V I, et al. Two-stream neural networks for tampered face detection[C] //Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition Workshops (CVPRW). Piscataway, NJ: IEEE, 2017: 1831−1839
[94]	Nataraj L, Mohammed T M, Manjunath B, et al. Detecting GAN generated fake images using co-occurrence matrices[J]. Electronic Imaging, 2019 : 1−7
[95]	Li Jiaming, Xie Hongtao, Li Jiahong, et al. Frequency-aware discriminative feature learning supervised by single-center loss for face forgery detection[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 6458−6467
[96]	Luo Yuchen, Zhang Yong, Yan Junchi, et al. Generalizing face forgery detection with high-frequency features[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 16317−16326
[97]	Shang Zhihua, Xie Hongtao, Zha Zhengjun, et al. PrrNet: Pixel-region relation network for face forger1y detection[J/OL]. Pattern Recognition, 2021, 116 [2021-10-15]. https://doi.org/10.1016/j.patcog.2021.107950
[98]	Li Yuezun, Lyu Siwei. Exposing deepfake videos by detecting face warping artifacts[J]. arXiv preprint, arXiv: 1811.00656, 2018
[99]	Li Lingzhi, Bao Jianmin, Zhang Ting, et al. Face x-ray for more general face forgery detection[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 5001−5010
[100]	Li Xurong, Yu Kun, Ji Shouling, et al. Fighting against deepfake: Patch&pair convolutional neural networks (PPCNN)[C] //Proc of the 29th the Web Conf . New York: ACM, 2020: 88−89
[101]	Nguyen H, Fang Fuming, Yamagishi J, et al. Multi-task learning for detecting and segmenting manipulated facial images and videos[J]. arXiv preprint, arXiv: 1906.06876, 2019
[102]	Nirkin Y, Wolf L, Keller Y, et al. Deepfake detection based on the discrepancy between the face and its context[J]. arXiv preprint, arXiv: 2008.12262, 2020
[103]	Amerini I, Caldelli R. Exploiting prediction error in consistencies through LSTM-based classifiers to detect deepfake videos[C] //Proc of the 8th ACM Workshop on Information Hiding and Multimedia Security. New York: ACM, 2020: 97−102
[104]	Amerini I, Galteri L, Caldelli R, et al. Deepfake video detection through optical flow based CNN[C] //Proc of the 32nd IEEE/CVF Int Conf on Computer Vision Workshops. Piscataway, NJ: IEEE, 2019: 1205−1207
[105]	Guera D, Delp E J. Deepfake video detection using recurrent neural networks[C/OL] //Proc of the 15th IEEE Int Conf on Advanced Video and Signal Based Surveillance (AVSS). Piscataway, NJ: IEEE, 2018 [2021-10-15]. https://doi.org/10.1109/AVSS.2018.8639163
[106]	Sun Zekun, Han Yujie, Hua Zeyu, et al. Improving the efficiency and robustness of deepfakes detection through precise geometric features[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 3609−3618
[107]	Sabir E, Cheng Jiaxin, Jaiswal A, et al. Recurrent convolutional strategies for face manipulation detection in videos[C] //Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops. Piscataway, NJ: IEEE, 2019: 80−87
[108]	Agarwal S, Farid H, Gu Yuming, et al. Protecting world leaders against deep fakes [C] //Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops. Piscataway, NJ: IEEE, 2019: 38−45
[109]	Agarwal S, Farid H, Fried O, et al. Detecting deep-fake videos from phoneme-viseme mismatches[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops. Piscataway, NJ: IEEE, 2020: 660−661
[110]	Yang Xin, Li Yuezun, Lyu Siwei. Exposing deep fakes using inconsistent head poses[C] //Proc of the 44th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 8261−8265
[111]	Ciftci U A, Demir I, Yin Lijun. FakeCatcher: Detection of synthetic portrait videos using biological signals[J/OL]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020 [2021-10-15]. https://doi.org/10.1109/T PAMI.2020.3009287
[112]	Fernandes S, Raj S, Ewetz R, et al. Detecting deepfake videos using attribution-based confidence metric[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops. Piscataway, NJ: IEEE, 2020: 308−309
[113]	Jha S, Raj S, Fernandes S, et al. Attribution-based confidence metric for deep neural networks[C] //Proc of the 32nd Int Conf on Neural Information Processing Systems. La Jolla, CA : NIPS, 2019: 11826−11837
[114]	McCloskey S, Albright M. Detecting GAN-generated imagery using color cues[J]. arXiv preprint, arXiv: 1812.08247, 2018
[115]	Guarnera L, Giudice O, Battiato S. Deepfake detection by analyzing convolutional traces[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops. Piscataway, NJ: IEEE, 2020: 666−667.
[116]	Qian Yuyang, Yin Guojun, Sheng Lu, et al. Thinking in frequency: Face forgery detection by mining frequency-aware clues[C] //Proc of the 16th European Conf on Computer Vision. Berlin: Springer, 2020: 86−103
[117]	Masi I, Killekar A, Mascarenhas R M, et al. Two-branch recurrent network for isolating deepfakes in videos[C] //Proc of the 16th European Conf on Computer Vision. Berlin: Springer, 2020: 667−684
[118]	Liu Honggu, Li Xiaodan, Zhou Wenbo, et al. Spatial-phase shallow learning: Rethinking face forgery detection in frequency domain[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 772−781
[119]	Agarwal S, Farid H, EL-Gaaly T, et al. Detecting deepfake videos from appearance and behavior[C/OL] //Proc of the 12th IEEE Int Workshop on Information Forensics and Security (WIFS). Piscataway, NJ: IEEE, 2020 [2021-10-15]. https://doi.org/10.1109/WIFS49906.2020.9360904
[120]	Wiles O, Koepke A, Zisserman A. Self-supervised learning of a facial attribute embedding from video[J]. arXiv preprint, arXiv: 1808.06882, 2018
[121]	Cozzolino D, Rossler A, Thies J, et al. Id-reveal: Identity-aware deepfake video detection[J]. arXiv preprint, arXiv: 2012.02512, 2020
[122]	Dong Xiaoyi, Bao Jianmin, Chen Dongdong, et al. Identity-driven deepfake detection[J]. arXiv preprint, arXiv: 2012.03930, 2020
[123]	Jiang Jun, Wang Bo, Li Bing, et al. Practical face swapping detection based on identity spatial constraints[C] //Proc of the 7th IEEE Int Joint Conf on Biometrics (IJCB). Piscataway, NJ: IEEE, 2021: 1−8
[124]	Lewis J K, Toubal I E, Chen Helen, et al. Deepfake video detection based on spatial, spectral, and temporal inconsistencies using multi-modal deep learning[C/OL] //Proc of the 49th IEEE Applied Imagery Pattern Recognition Workshop (AIPR). Piscataway, NJ: IEEE, 2020 [2021-10-15]. https://doi.org/10.1109/AIPR50011.2020.9425167
[125]	Lomnitz M, Hampel-arias Z, Sandesara V, et al. Multimodal approach for deepfake detection[C/OL] //Proc of the 49th IEEE Applied Imagery Pattern Recognition Workshop (AIPR). Piscataway, NJ: IEEE, 2020 [2021-10-15]. https://doi.org/10.1109/AIPR50011.2020.9425192
[126]	Ravanelli M, Bengio Y. Speaker recognition from raw waveform with SincNet[C] //Proc of the 7th IEEE Spoken Language Technology Workshop(SLT). Piscataway, NJ: IEEE, 2018: 1021−1028
[127]	Mittal T, Bhattacharya U, Chandra R, et al. Emotions don’t lie: An audio-visual deepfake detection method using affective cues[C] //Proc of the 28th ACM Int Conf on Multimedia. New York: ACM, 2020: 2823−2832
[128]	Hosler B, Salvi D, Murray A, et al. Do deepfakes feel emotions? A semantic approach to detecting deepfakes via emotional inconsistencies[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 1013−1022
[129]	Afchar D, Nozick V, Yamagishi J, et al. MesoNet: A compact facial video forgery detection network[C/OL] //Proc of the 10th IEEE Int Workshop on Information Forensics and Security (WIFS). Piscataway, NJ: IEEE, 2018 [2021-10-15]. https://doi.org/10.1109/WIFS.2018.8630761
[130]	Jain A, Singh R, Vatsa M. On detecting GANs and retouching based synthetic alterations[C/OL] //Proc of the 9th Int Conf on Biometrics Theory, Applications and Systems (BTAS). Piscataway, NJ: IEEE, 2018 [2021-10-15]. https://doi.org/10.1109/BTAS.2018.8698545
[131]	Wang Run, Xu Juefei, Ma Lei, et al. FakeSpotter: A simple yet robust baseline for spotting ai-synthesized fake faces[J]. arXiv preprint, arXiv: 1909.06122, 2019
[132]	Dang Hao, Liu Feng, Stehouwer J, et al. On the detection of digital face manipulation[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 5781−5790
[133]	Hsu C C, Zhuang Yixiu, Lee C Y. Deep fake image detection based on pairwise learning[J/OL]. Applied Sciences, 2020 [2021-10-15]. https://doi.org/10.3390/app10010370
[134]	Khalid H, Woo S S. Oc-fakedect: Classifying deepfakes using one-class variational autoencoder[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops. Piscataway, NJ: IEEE, 2020: 656−657
[135]	Rana M S, Sung A H. DeepfakeStack: A deep ensemble-based learning technique for deepfake detection[C] //Proc of the 7th IEEE Int Conf on Cyber Security and Cloud Computing(CSCloud)/IEEE Int Conf on Edge Computing and Scalable Cloud (EdgeCom). Piscataway, NJ: IEEE, 2020: 70−75
[136]	Bonettini N, Cannas E D, Mandelli S, et al. Video face manipulation detection through ensemble of CNNs[C] //Proc of the 31st Int Conf on Pattern Recognition (ICPR). Piscataway, NJ: IEEE, 2021: 5012−5019
[137]	Kim M, Tariq S, Woo S S. FReTal: Generalizing deepfake detection using knowledge distillation and representation learning[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 1001−1012
[138]	Aneja S, Niessner M. Generalized zero and few-shot transfer for facial forgery detection[J]. arXiv preprint, arXiv: 2006.11863, 2020
[139]	Wang Chengrui, Deng Weihong. Representative forgery mining for fake face detection[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 14923−14932
[140]	Zhao Hanqing, Zhou Wenbo, Chen Dongdong, et al. Multi-attentional deepfake detection[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 2185−2194
[141]	Kumar P, Vatsa M, Singh R. Detecting face2face facial reenactment in videos[C] //Proc of the 20th IEEE/CVF Winter Conf on Applications of Computer Vision. Piscataway, NJ: IEEE, 2020: 2589− 2597
[142]	Jeon H, Bang Y, Woo S S. FdftNet: Facing off fake images using fake detection fine-tuning network[C] //Proc of the 35th IFIP Int Conf on ICT Systems Security and Privacy Protection. Berlin: Springer, 2020: 416−430
[143]	Wang Shengyu, Wang O, Zhang R, et al. CNN-generated images are surprisingly easy to spot for now[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 8695−8704
[144]	Liu Zhengzhe, Qi Xiaojuan, Torr P. Global texture enhancement for fake face detection in the wild[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 8060−8069
[145]	Wodajo D, Atnafu S. Deepfake video detection using convolutional vision transformer[J]. arXiv preprint, arXiv: 2102.11126, 2021
[146]	Wang Junke, Wu Zuxuan, Chen Jingjing, et al. M2tr: Multi-modal multi-scale transformers for deepfake detection[J]. arXiv preprint, arXiv: 2104.09770, 2021
[147]	Heo Y, Choi Y, Lee Y, et al. Deepfake detection scheme based on vision transformer and distillation[J]. arXiv preprint, arXiv: 2104.01353, 2021
[148]	Dolhansky B, Howes R, Pflaum B, et al. The deepfake detection challenge (DFDC) preview dataset[J]. arXiv preprint, arXiv: 1910.08854, 2019
[149]	Li Yuezun, Yang Xin, Sun Pu, et al. Celeb-DF: A large-scale challenging dataset for deepfake forensics[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 3207−3216
[150]	Ondyari. Deepfake detection (DFD) dataset [DB/OL]. 2018 [2021-10-15]. https://github.com/ondyari/FaceForensics
[151]	Koeshunov P, Marcel S. Deepfakes: A new threat to face recognition? assessment and detection[J]. arXiv preprint, arXiv: 1812.08685, 2018
[152]	Li Yuezun, Yang Xin, Sun Pu, et al. Celeb-DF (v2): A new dataset for deepfake forensics[J]. arXiv preprint, arXiv: 1909.12962, 2019
[153]	Ruiz N, Bargal S A, Sclaroff S. Disrupting deepfakes: Adversarial attacks against conditional image translation networks and facial manipulation systems[C] //Proc of the 16th European Conf on Computer Vision. Berlin: Springer, 2020: 236−251
[154]	Huang Qidong, Zhang Jie, Zhou Wenbo, et al. Initiative defense against facial manipulation[C] //Proc of the 35th AAAI Conf on Artificial Intelligence. New York: ACM, 2021: 1619−1627
[155]	Dong Junhao, Xie Xiaohua. Visually maintained image disturbance against deepfake face swapping [C/OL] //Proc of the 22nd IEEE Int Conf on Multimedia and Expo (ICME). Piscataway, NJ: IEEE, 2021 [2021-10-15]. https://doi.org/10.1109/ICME51207.2021.9428173
[156]	Neves J C, Tolosana R, Vera-rodriguez R, et al. Real or fake? Spoofing state-of-the-art face synthesis detection systems[J]. arXiv preprint, arXiv: 1911.05351, 2019
[157]	Carlini N, Farid H. Evading deepfake-image detectors with white- and black-box attacks[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops. Piscataway, NJ: IEEE, 2020: 658−659
[158]	Hussain S, Neekhara P, Jere M, et al. Adversarial deepfakes: Evaluating vulnerability of deepfake detectors to adversarial examples [C] //Proc of the 21st IEEE/CVF Winter Conf on Applications of Computer Vision. Piscataway, NJ: IEEE, 2021: 3348− 3357
[159]	Patel T B, Patil H A. Cochlear filter and instantaneous frequency based features for spoofed speech detection[J]. IEEE Journal of Selected Topics in Signal Processing, 2016, 11(4): 618−631
[160]	Tom F, Jain M, Dey P. End-to-end audio replay attack detection using deep convolutional networks with attention.[C] //Proc of the 20th Interspeech. 2018 [2021-10-15]. https://www.isca-speech.org/archive_v0/Interspeech_2018/abstracts/2279.html
[161]	Das R K, Yang Jichen, Li Haizhou. Assessing the scope of generalized counter-measures for anti-spoofing[C] //Proc of the 45th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 6589−6593
[162]	Lavrentyeva G, Novoselov S, Malykh E, et al. Audio replay attack detection with deep learning frameworks[C] //Proc of the 19th Interspeech. 2017 [2021-10-15]. https://www.isca-speech.org/archive_v0/Interspeech_2017/abstracts/0360.html
[163]	Wu Xiang, He Ran, Sun Zhenan, et al. A light CNN for deep face representation with noisy labels[J]. IEEE Transactions on Information Forensics and Security, 2018, 13(11): 2884−2896 doi: 10.1109/TIFS.2018.2833032
[164]	Lavrentyeva G, Novoselov S, Tseren A, et al. Stc anti-spoofing systems for the ASVspoof 2019 challenge[J]. arXiv preprint, arXiv: 1904.05576, 2019
[165]	Cai Weicheng, Wu Haiwei, Cai Danwei, et al. The DKU replay detection system for the ASVspoof 2019 challenge: On data augmentation, feature representation, classification, and fusion[J]. arXiv preprint, arXiv: 1907.02663, 2019
[166]	Lai C I, Chen Nanxin, Villalba J, et al. Assert: Anti-spoofing with squeeze-excitation and residual networks[J]. arXiv preprint, arXiv: 1904.01120, 2019
[167]	Parasu P, Epps J, Sriskandaraja K, et al. Investigating light-resnet architecture for spoofing detection under mismatched conditions[C] // Proc of the 22nd Interspeech. 2020 [2021-10-15]. https://www.isca-speech.org/archive_v0/Interspeech_2020/abstracts/2039.html
[168]	Ma Haoxin, Yi Jiangyan, Tao Jianhua, et al. Continual learning for fake audio detection[J]. arXiv preprint, arXiv: 2104.07286, 2021
[169]	Li Zzhizhong, Hoiem D. Learning without forgetting[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(12): 2935−2947
[170]	Dolhansky B, Bitton J, Pflaum B, et al. The deepfake detection challenge (DFDC) dataset[J]. arXiv preprint, arXiv: 2006.07397, 2020
[171]	Peng Bo, Fan Hongxing, Wang Wei, et al. DFGC 2021: A deepfake game competition[J]. arXiv preprint, arXiv: 2106.01217, 2021
[172]	Zi Bojia, Chang Minghao, Chen Jingjing, et al. Wild Deepfake: A challenging real-world dataset for deepfake detection[C] //Proc of the 28th ACM Int Conf on Multimedia. New York: ACM, 2020: 2382−2390
[173]	Jiang Liming, Li Ren, Wu W, et al. DeeperForensics-1.0: A large-scale dataset for real-world face forgery detection[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 2889−2898
[174]	Fox G, Liu Wentao, Kim H, et al. Video ForensicsHQ: Detecting high-quality manipulated face videos[C/OL] //Proc of the 22nd IEEE Int Conf on Multimedia and Expo (ICME). Piscataway, NJ: IEEE, 2021 [2021-10-15]. https://doi.org/10.1109/ICME51207.2021.9428101
[175]	Zhou Tianfei, Wang Wenguan, Liang Zhiyuan, et al. Face forensics in the wild[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 5778−5788
[176]	He Yinan, Gan Bei, Chen Siyu, et al. ForgeryNet: A versatile benchmark for comprehensive forgery analysis[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 4360−4369
[177]	Khalid H, Tariq S, Woo S S. FakeAVCeleb: A novel audio-video multimodal deepfake dataset[J]. arXiv preprint, arXiv: 2108.05080, 2021
[178]	University of Edinburgh, the Centre for Speech Technology Research (CSTR). ASVspoof 2015 database[DB/OL]. 2015 [2021-10-15]. https://datasha re.ed.ac.uk/handle/10283/853
[179]	University of Edinburgh, the Centre for Speech Technology Research (CSTR). ASVspoof 2017 database [DB/OL]. 2017 [2021-10-15]. https://datashar e.ed.ac.uk/handle/10283/3055.
[180]	University of Edinburgh, the Centre for Speech Technology Research (CSTR). ASVspoof 2019 database [DB/OL]. 2019 [2021-10-15]. https://datashar e.ed.ac.uk/handle/10283/3336.
[181]	Krishnan P, Kovvuri R, Pang Guan, et al. Textstyle brush: Transfer of text aesthetics from a single example[J]. arXiv preprint, arXiv: 2106.08385, 2021

施引文献(4)

期刊类型引用(4)

1.	任燕，徐洪蕾，苏轼鹏，杜振彩. 基于字典学习的稀疏约束型数据同化海洋污染预报研究. 环境科学与管理. 2025(02): 62-67 . 百度学术
2.	李维钊，王伟. 基于国产异构计算平台的快速SVD算法及其在海洋资料同化的应用. 数据与计算发展前沿. 2024(01): 35-45 . 百度学术
3.	李海晏. 海洋大数据标准化现状与对策研究. 市场监管与质量技术研究. 2024(03): 44-50 . 百度学术
4.	蒋骋，田家勇，兰晓雯. 数据同化方法在固体地球物理学中的应用研究进展. 大地测量与地球动力学. 2024(08): 857-866 . 百度学术

其他类型引用(0)

资源附件(0)

图(5) / 表(4)

计量

文章访问数: 1286
HTML全文浏览量: 147
PDF下载量: 501
被引次数: 4

1. 相关工作
2. 方法概述
2.1 问题定义
2.2 算法框架
3. NER模型预训练
4. RLTL-DSNER中的强化学习方法
4.1 状　态
4.2 动　作
4.3 奖　励
5. 实　　验
5.1 实验设置
5.2 模型对比
5.3 数据分析
5.4 案例分析
6. 结　　论

多模态深度伪造及检测技术综述

通讯作者: 纪守领（sji@zju.edu.cn）

计量

出版历程

A Survey on Multimodal Deepfake and Detection Techniques

1. 相关工作

2. 方法概述

2.1 问题定义

2.2 算法框架

3. NER模型预训练

4. RLTL-DSNER中的强化学习方法

4.1 状 态

4.2 动 作

4.3 奖 励

5. 实 验

5.1 实验设置

5.2 模型对比

5.3 数据分析

5.4 案例分析

6. 结 论

期刊类型引用(4)

其他类型引用(0)

计量

出版历程

目录

1. 相关工作

2. 方法概述

2.1 问题定义

2.2 算法框架

3. NER模型预训练

4. RLTL-DSNER中的强化学习方法

4.1 状 态

4.2 动 作

4.3 奖 励

5. 实 验

5.1 实验设置

5.2 模型对比

5.3 数据分析

5.4 案例分析

6. 结 论

通讯作者:
纪守领（sji@zju.edu.cn）

4.1 状　态

4.2 动　作

4.3 奖　励

5. 实　　验

6. 结　　论

4.1 状　态

4.2 动　作

4.3 奖　励

5. 实　　验

6. 结　　论