ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2021, Vol. 58 ›› Issue (7): 1466-1475.doi: 10.7544/issn1000-1239.2021.20200799

Special Issue: 2021虚假信息检测专题

Previous Articles     Next Articles

Global and Temporal-Frequency Attention Based Network in Audio Deepfake Detection

Wang Chenglong1,2, Yi Jiangyan2, Tao Jianhua2,3, Ma Haoxin2, Tian Zhengkun2, Fu Ruibo2   

  1. 1(College of Information Science and Technology, University of Science and Technology, Hefei 230027);2(National Laboratory of Pattern Recognition (Institute of Automation, Chinese Academy of Sciences), Beijing 100080);3(School of Artifical Intellgence, University of Chinese Academy of Sciences, Beijing 100049)
  • Online:2021-07-01
  • Supported by: 
    This work was supported by the National Key Research and Development Program of China (2017YFC0820602), the National Natural Science Foundation of China (61831022, 61901473, 61771472, 61773379), and Inria-CAS Joint Research Project (173211KYSB20190049).

Abstract: Audio deepfake detection is a hot topic in recent years and has been widely concerned. At present, convolutional neural networks and their variants have made good progress in the task of audio deepfake detection. However, there are still two problems: 1) The assumption of current work is that each aspect of the feature map fed into the convolutional neural network has the same effect on the result, ignoring that the information emphasized at different locations on each dimensional feature map is different. 2) In addition, the current work focuses on the local information of the feature map, and cannot make use of the relationship between the feature map from the global view. To solve these challenges, we introduce a global and temporal-frequency attention based network that focuses on channel dimensions and temporal-frequency dimensions, respectively. Specifically, we introduced two parallel attention modules. One is the temporal-frequency attention module and the other is the global attention module. For the temporal-frequency attention module, we can update the features by using weighted aggregation on all temporal-frequency feature maps. For the global attention module, we draw on the idea of SE-Net to generate weights for each feature channel by parameters. And by this way, we can get the global distribution of the response on the feature channel. We have carried out a series of experiments on ASVspoof2019 LA open data set, and the results showed that the proposed model achieved good results, and the EER of the best model reached 4.12%, which refreshed the best results of the single model.

Key words: audio deepfake detection, attention mechanism, voice forgery detection, global attention, temporal-frequency attention

CLC Number: