ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2021, Vol. 58 ›› Issue (7): 1466-1475.doi: 10.7544/issn1000-1239.2021.20200799

所属专题: 2021虚假信息检测专题

• 信息处理 • 上一篇    下一篇

基于全局-时频注意力网络的语音伪造检测

王成龙1,2,易江燕2,陶建华2,3,马浩鑫2,田正坤2,傅睿博2   

  1. 1(中国科学技术大学信息科学技术学院 合肥 230027);2(模式识别国家重点实验室(中国科学院自动化研究所) 北京 100080);3(中国科学院大学人工智能学院 北京 100049) (chenglong.wang@nlpr.ia.ac.cn)
  • 出版日期: 2021-07-01
  • 基金资助: 
    国家重点研发计划项目(2017YFC0820602);国家自然科学基金项目(61831022,61901473,61771472,61773379);法国国家信息与自动化研究所与中国科学院联合科研项目(173211KYSB20190049)

Global and Temporal-Frequency Attention Based Network in Audio Deepfake Detection

Wang Chenglong1,2, Yi Jiangyan2, Tao Jianhua2,3, Ma Haoxin2, Tian Zhengkun2, Fu Ruibo2   

  1. 1(College of Information Science and Technology, University of Science and Technology, Hefei 230027);2(National Laboratory of Pattern Recognition (Institute of Automation, Chinese Academy of Sciences), Beijing 100080);3(School of Artifical Intellgence, University of Chinese Academy of Sciences, Beijing 100049)
  • Online: 2021-07-01
  • Supported by: 
    This work was supported by the National Key Research and Development Program of China (2017YFC0820602), the National Natural Science Foundation of China (61831022, 61901473, 61771472, 61773379), and Inria-CAS Joint Research Project (173211KYSB20190049).

摘要: 语音伪造检测是近年的一个研究热点,受到了广泛关注.目前,卷积神经网及其变种的提出,使其在语音伪造检测任务中取得了不错进展.然而,目前仍存在2方面问题:1)当前工作假设送入卷积神经网络的特征图的每一维对结果的影响是相同的,忽视了每一维上特征图的不同位置强调的信息是不一样的.2)此外,前人工作大多关注特征图的局部信息,没有利用全局视图中特征图之间的关系.为了解决以上挑战,引入全局-时频注意力框架,分别对通道维度和时频维度做了注意力变换.具体而言,引入了2个并行的注意力模块:1)时频注意力模块;2)全局注意力模块.对于时频注意力模块,可以通过使用加权求和在所有时频特征图上聚合特征来进行更新.对于全局注意力模块,借鉴了SE-Net的思想,通过参数为每个特征通道生成权重.通过这种办法,可以得到特征通道上响应的全局分布.在ASVspoof2019 LA公开数据集上进行了一系列实验,结果显示所提的模型取得不错的效果,最佳模型的等错误率达到4.12%,刷新了单个模型的最好成绩.

关键词: 语音鉴伪, 注意力机制, 语音伪造检测, 全局注意力, 时频注意力

Abstract: Audio deepfake detection is a hot topic in recent years and has been widely concerned. At present, convolutional neural networks and their variants have made good progress in the task of audio deepfake detection. However, there are still two problems: 1) The assumption of current work is that each aspect of the feature map fed into the convolutional neural network has the same effect on the result, ignoring that the information emphasized at different locations on each dimensional feature map is different. 2) In addition, the current work focuses on the local information of the feature map, and cannot make use of the relationship between the feature map from the global view. To solve these challenges, we introduce a global and temporal-frequency attention based network that focuses on channel dimensions and temporal-frequency dimensions, respectively. Specifically, we introduced two parallel attention modules. One is the temporal-frequency attention module and the other is the global attention module. For the temporal-frequency attention module, we can update the features by using weighted aggregation on all temporal-frequency feature maps. For the global attention module, we draw on the idea of SE-Net to generate weights for each feature channel by parameters. And by this way, we can get the global distribution of the response on the feature channel. We have carried out a series of experiments on ASVspoof2019 LA open data set, and the results showed that the proposed model achieved good results, and the EER of the best model reached 4.12%, which refreshed the best results of the single model.

Key words: audio deepfake detection, attention mechanism, voice forgery detection, global attention, temporal-frequency attention

中图分类号: