Citation: | Le Zheng, Hu Yongting, Xu Yong. A Survey of Audio-Driven Talking Face Video Generation and Identification[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440207 |
With the rapid advancement of artificial intelligence generation models and deepfakes, the techniques for generating talking face videos using various methods have become increasingly mature. Among them, audio-driven talking face video generation methods have attracted significant attention due to their remarkably realistic and natural output. Such methods utilize audio as a driving source to synthesize videos where the target character’s mouth movements synchronize with the audio, often combining image or video materials. Currently, these technologies are widely applied in fields such as virtual anchors, gaming animation, and film and television production, demonstrating vast prospects for development. However, the potential negative impacts of this technology are also becoming apparent. Improper or abusive use could lead to serious political and economic consequences. In this context, research on identifying various types of facial forgery videos has emerged. This research primarily assesses the authenticity of videos by detecting the veracity of individual video frames or the spatio-temporal consistency of video sequences. Firstly, this paper systematically analyzes the classic algorithms and latest advancements in audio-driven talking face video generation tasks based on the timeline and the development history of foundational models. Secondly, it exhaustively lists the commonly used datasets and evaluation criteria for this task, conducting comprehensive comparisons across multiple dimensions. Subsequently, the paper meticulously analyzes and summarizes the forgery facial video identification task, categorizing it based on whether the discrimination technology focuses on individual video frames or multiple frames, and also summarizes its commonly used datasets and evaluation criteria. Finally, the paper outlines the challenges and future directions in this research field, aiming to provide valuable references and support for subsequent related research.
[1] |
宋一飞,张炜,陈智能,等. 数字说话人视频生成综述[J]. 计算机辅助设计与图形学学报,2023,35(10):1457−1468
Song Yifei, Zhang Wei, Chen Zhineng, et al. A survey on talking head generation[J]. Journal of Computer-Aided Design & Computer Graphics, 2023, 35(10): 1457−1468 (in Chinese)
|
[2] |
Bainey K. AI-Driven Project Management: Harnessing the Power of Artificial Intelligence and ChatGPT to Achieve Peak Productivity and Success[M]. Hoboken, NJ: John Wiley & Sons, 2024
|
[3] |
张溢文,蔡满春,陈咏豪,等. 融合空间特征的多尺度深度伪造检测方法[J/OL]. 计算机工程:1−12[2024-07-06]. https://doi.org/10.19678/j.issn.1000-3428.0067789
Zhang Yiwen, Cai Manchun, Chen Yonghao, et al. Multi-scale deepfake detection menthod with fusion of spatial features[J/OL]. Computer Engineering: 1−12[2024-07-06]. https://doi.org/10.19678/j.issn.1000-3428.0067789 (in Chinese)
|
[4] |
盛文俊,曹林,张帆. 基于有监督注意力网络的伪造人脸视频检测[J]. 计算机工程与设计,2023,44(2):504−510
Sheng Wenjun, Cao Wenjun, Zhang Fan. Forged facial video detection based on supervised attention network[J]. Computer Engineering and Design, 2023, 44(2): 504−510 (in Chinese)
|
[5] |
Morishima S, Aizawa K, Harashima H. An intelligent facial image coding driven by speech and phoneme[C]//Proc of the 13rd Int Conf on Acoustics, Speech, and Signal Processing. Piscataway, NJ: IEEE, 1989: 1795−1798
|
[6] |
Morishima S, Harashima H. A media conversion from speech to facial image for intelligent man-machine interface[J]. IEEE Journal on Selected Areas in Communications, 1991, 9(4): 594−600 doi: 10.1109/49.81953
|
[7] |
Yamamoto E, Nakamura S, Shikano K. Lip movement synthesis from speech based on Hidden Markov Models[J]. Speech Communication, 1998, 26(1/2): 105−115
|
[8] |
Lee S, Yook D S. Audio-to-visual conversion using hidden Markov models[C]//Proc of the 7th Pacific Rim Int Conf on Artificial Intelligence. Berlin: Springer, 2002: 563−570
|
[9] |
Aleksic P S, Katsaggelos A K. Speech-to-video synthesis using MPEG-4 compliant visual features[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2004, 14(5): 682−692 doi: 10.1109/TCSVT.2004.826760
|
[10] |
Zhang Xinjian, Wang Lijuan, Li Gang, et al. A new language independent, photo-realistic talking head driven by voice only[C]//Proc of the 14th Annual Conf of the Int Speech Communication Association. New York: ISCA, 2013: 2743−2747
|
[11] |
Taylor S, Kim T, Yue Y, et al. A deep learning approach for generalized speech animation[J]. ACM Transactions on Graphics (TOG), 2017, 36(4): 1−11
|
[12] |
Chung J S, Zisserman A. Out of time: Automated lip sync in the wild[C]//Proc of the 13th Asian Conf on Computer Vision. Berlin: Springer, 2017: 251−263
|
[13] |
Chung J S, Jamaludin A, Zisserman A. You said that?[J]. arXiv preprint, arXiv: 1705.02966, 2017
|
[14] |
Karras T, Aila T, Laine S, et al. Audio-driven facial animation by joint end-to-end learning of pose and emotion[J]. ACM Transactions on Graphics (ToG), 2017, 36(4): 1−12
|
[15] |
Cudeiro D, Bolkart T, Laidlaw C, et al. Capture, learning, and synthesis of 3D speaking styles[C]//Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 10101−10111
|
[16] |
Fan Bo, Wang Lijuan, Soong F K, et al. Photo-real talking head with deep bidirectional LSTM[C]//Proc of the 40th IEEE Int Conf on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2015: 4884−4888
|
[17] |
Fan Bo, Xie Lei, Yang Shan, et al. A deep bidirectional LSTM approach for video-realistic talking head[J]. Multimedia Tools and Applications, 2016, 75(9): 5287−5309 doi: 10.1007/s11042-015-2944-3
|
[18] |
Suwajanakorn S, Seitz S M, Kemelmacher-Shlizerman I. Synthesizing obama: Learning lip sync from audio[J]. ACM Transactions on Graphics (ToG), 2017, 36(4): 1−13
|
[19] |
Pham H X, Cheung S, Pavlovic V. Speech-driven 3D facial animation with implicit emotional awareness: A deep learning approach[C]//Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition Workshops. Piscataway, NJ: IEEE, 2017: 80−88
|
[20] |
Eskimez S E, Maddox R K, Xu Chenliang, et al. Generating talking face landmarks from speech[C]//Proc of the 14th Int Conf on Latent Variable Analysis and Signal Separation. Berlin: Springer, 2018: 372−381
|
[21] |
Thies J, Elgharib M, Tewari A, et al. Neural voice puppetry: Audio-driven facial reenactment[C]//Proc of the 16th European Conf on Computer Vision. Berlin: Springer, 2020: 716−731
|
[22] |
Zhou Yang, Han Xintong, Shechtman E, et al. Makelttalk: Speaker-aware talking-head animation[J]. ACM Transactions On Graphics, 2020, 39(6): 1−15
|
[23] |
Wang Suzhen, Li Lincheng, Ding Yu, et al. Audio2head: Audio-driven one-shot talking-head generation with natural head motion[J]. arXiv preprint arXiv: 2107.09293, 2021
|
[24] |
Song Linsen, Wu W, Qian Chen, et al. Everybody’s talkin’: Let me talk as you want[J]. IEEE Transactions on Information Forensics and Security, 2022, 17: 585−598 doi: 10.1109/TIFS.2022.3146783
|
[25] |
Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139−144 doi: 10.1145/3422622
|
[26] |
Song Yang, Zhu Jingwen, Li Dawei, et al. Talking face generation by conditional recurrent adversarial network[J]. arXiv preprint, arXiv: 1804.04786, 2018
|
[27] |
Chen Lele, Li Zhiheng, Maddox R K, et al. Lip movements generation at a glance[C]//Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 520−535
|
[28] |
Chen Lele, Maddox R K, Duan Zhiyao, et al. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss[C]//Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 7832−7841
|
[29] |
Prajwal K R, Mukhopadhyay R, Philip J, et al. Towards automatic face-to-face translation[C]//Proc of the 27th ACM Int Conf on Multimedia. New York: ACM, 2019: 1428−1436
|
[30] |
Prajwal K R, Mukhopadhyay R, Namboodiri V P, et al. A lip sync expert is all you need for speech to lip generation in the wild[C]//Proc of the 28th ACM Int Conf on Multimedia. New York: ACM, 2020: 484−492
|
[31] |
Wang Jiadong, Qian Xinyuan, Zhang Malu, et al. Seeing what you said: Talking face generation guided by a lip reading expert[C]//Proc of the 36th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2023: 14653−14662
|
[32] |
Yin Fei, Zhang Yong, Cun Xiaodong, et al. Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan[C]//Proc of the 17th European Conf on Computer Vision. Berlin: Springer, 2022: 85−101
|
[33] |
Karras T, Laine S, Aittala M, et al. Analyzing and improving the image quality of stylegan[C]//Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE , 2020: 8110−8119
|
[34] |
Park S J, Kim M, Hong J, et al. Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory[C]//Proc of the 36th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2022: 2062−2070
|
[35] |
Goyal S, Bhagat S, Uppal S, et al. Emotionally enhanced talking face generation[C]//Proc of the 1st Int Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice. New York: ACM, 2023: 81−90
|
[36] |
Kingma D P, Welling M. Auto-encoding variational bayes[J]. arXiv preprint, arXiv: 1312.6114, 2013
|
[37] |
Mittal G, Wang Baoyuan. Animating face using disentangled audio representations[C]//Proc of the 2020 IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway, NJ: IEEE, 2020: 3290−3298
|
[38] |
Liu Jin, Wang Xi, Fu Xiaomeng, et al. Font: Flow-guided one-shot talking head generation with natural head motions[C]//Proc of the 24th IEEE Int Conf on Multimedia and Expo. Piscataway, NJ: IEEE, 2023: 2099−2104
|
[39] |
Doersch C. Tutorial on variational autoencoders[J]. arXiv preprint, arXiv: 1606.05908, 2016
|
[40] |
Zhang Wenxuan, Cun Xiaodong, Wang Xuan, et al. Sadtalker: Learning realistic 3D motion coefficients for stylized audio-driven single image talking face animation[C]//Proc of the 36th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2023: 8652−8661
|
[41] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. arXiv preprint, arXiv. 1706.03762, 2017
|
[42] |
Khan S, Naseer M, Hayat M, et al. Transformers in vision: A survey[J]. ACM Computing Surveys (CSUR), 2022, 54(10): 1−41
|
[43] |
Fan Yingruo, Lin Zhaojiang, Saito J, et al. Faceformer: Speech-driven 3D facial animation with transformers[C]//Proc of the 35th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE , 2022: 18770−18780
|
[44] |
Wang Jiayu, Zhao Kang, Zhang Shiwei, et al. Lipformer: High-fidelity and generalizable talking face generation with a pre-learned facial codebook[C]//Proc of the 36th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE , 2023: 13844−13853
|
[45] |
Zhong Weizhi, Fang Chaowei, Cai Yinqi, et al. Identity-preserving talking face generation with landmark and appearance priors[C]//Proc of the 36th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE , 2023: 9729−9738
|
[46] |
Ma Haoyu, Zhang Tong, Sun Shanlin, et al. CVTHead: One-shot controllable head avatar with vertex-feature transformer[C]//Proc of the 2024 IEEE/CVF Winter Conf on Applications of Computer Vision. Piscataway, NJ: IEEE, 2024: 6131−6141
|
[47] |
Mildenhall B, Srinivasan P P, Tancik M, et al. Nerf: Representing scenes as neural radiance fields for view synthesis[J]. Communications of the ACM, 2021, 65(1): 99−106
|
[48] |
Guo Yudong, Chen Keyu, Liang Sen, et al. Ad-nerf: Audio driven neural radiance fields for talking head synthesis[C]//Proc of the 18th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 5784−5794
|
[49] |
Gafni G, Thies J, Zollhofer M, et al. Dynamic neural radiance fields for monocular 4D facial avatar reconstruction[C]//Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 8649−8658
|
[50] |
Yao Shunyu, Zhong Ruizhe, Yan Yichao, et al. DFA-NeRF: Personalized talking head generation via disentangled face attributes neural rendering[J]. arXiv preprint, arXiv: 2201.00791, 2022
|
[51] |
Tang Jiaxiang, Wang Kaisiyuan, Zhou Hang, et al. Real-time neural radiance talking portrait synthesis via audio-spatial decomposition[J]. arXiv preprint, arXiv: 2211.12368, 2022
|
[52] |
Bi Chongke, Liu Xiaoxing, Liu Zhilei. NeRF-AD: Neural radiance field with attention-based disentanglement for talking face synthesis[J]. arXiv preprint, arXiv: 2401.12568, 2024
|
[53] |
Sohl-Dickstein J, Weiss E, Maheswaranathan N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[C]//Proc of the 32nd Int Conf on Machine Learning. New York: ACM, 2015: 2256−2265
|
[54] |
Bigioi D, Basak S, Stypułkowski M, et al. Speech driven video editing via an audio-conditioned diffusion model[J]. arXiv preprint, arXiv: 2301.04474, 2023
|
[55] |
Stypułkowski M, Vougioukas K, He Sen, et al. Diffused heads: Diffusion models beat gans on talking-face generation[C]//Proc of the 2024 IEEE/CVF Winter Conf on Applications of Computer Vision. Piscataway, NJ: IEEE, 2024: 5091−5100
|
[56] |
Shen Shuai, Zhao Wenliang, Meng Zibin, et al. DiffTalk: Crafting diffusion models for generalized audio-driven portraits animation[C]//Proc of the 36th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE , 2023: 1982−1991
|
[57] |
Rombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models[C]//Proc of the 35th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 10684−10695
|
[58] |
Zhang Bingyuan, Zhang Xulong, Cheng Ning, et al. Emotalker: Emotionally editable talking face generation via diffusion model[C]//Proc of the 49th IEEE Int Conf on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2024: 8276−8280
|
[59] |
Cooke M, Barker J, Cunningham S, et al. An audio-visual corpus for speech perception and automatic speech recognition[J]. The Journal of the Acoustical Society of America, 2006, 120(5): 2421−2424 doi: 10.1121/1.2229005
|
[60] |
Cao Houwei, Cooper D G, Keutmann M K, et al. Crema-d: Crowd-sourced emotional multimodal actors dataset[J]. IEEE Transactions on Affective Computing, 2014, 5(4): 377−390 doi: 10.1109/TAFFC.2014.2336244
|
[61] |
Wang Kaisiyuan, Wu Qianyi, Song Linsen, et al. Mead: A large-scale audio-visual dataset for emotional talking-face generation[C]//Proc of the 16th European Conf on Computer Vision. Berlin: Springer, 2020: 700−717
|
[62] |
Wuu C, Zheng Ningyuan, Ardisson S, et al. Multiface: A dataset for neural face rendering[J]. arXiv preprint, arXiv: 2207.11243, 2022
|
[63] |
Wu Sijing, Li Yunhao, Zhang Weitian, et al. SingingHead: A large-scale 4D dataset for singing head animation[J]. arXiv preprint, arXiv: 2312.04369, 2023
|
[64] |
Chung J S, Zisserman A. Lip reading in the wild[C]//Proc of the 13th Asian Conf on Computer Vision. Berlin: Springer, 2017: 87−103
|
[65] |
Nagrani A, Chung J S, Zisserman A. Voxceleb: A large-scale speaker identification dataset[J]. arXiv preprint, arXiv: 1706.08612, 2017
|
[66] |
Afouras T, Chung J S, Senior A, et al. Deep audio-visual speech recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 44(12): 8717−8727
|
[67] |
Afouras T, Chung J S, Zisserman A. LRS3-TED: A large-scale dataset for visual speech recognition[J]. arXiv preprint, arXiv: 1809.00496, 2018
|
[68] |
Chung J S, Nagrani A, Zisserman A. Voxceleb2: Deep speaker recognition[J]. arXiv preprint, arXiv: 1806.05622, 2018
|
[69] |
Yang Shuang, Zhang Yuanhang, Feng Dalu, et al. LRW−1000: A naturally-distributed large-scale benchmark for lip reading in the wild[J]. arXiv preprint, arXiv: 1810.06990, 2018
|
[70] |
Zhang Zhimeng, Li Lincheng, Ding Yu, et al. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset[C]//Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 3661−3670
|
[71] |
Zhu Hao, Wu W, Zhu Wentao, et al. CelebV-HQ: A large-scale video facial attributes dataset[C]//Proc of the 17th European Conf on Computer Vision. Berlin: Springer, 2022: 650−667
|
[72] |
Xie Liangbin, Wang Xintao, Zhang Honglun, et al. Vfhq: A high-quality dataset and benchmark for video face super-resolution[C]//Proc of the 35th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 657−666
|
[73] |
Dao T T, Vu D H, Pham C, et al. EFHQ: Multi-purpose ExtremePose-Face-HQ dataset[C]//Proc of the 37th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2024: 22605−22615
|
[74] |
Hore A, Ziou D. Image quality metrics: PSNR vs SSIM[C]//Proc of the 20th Int Conf on Pattern Recognition. Piscataway, NJ: IEEE, 2010: 2366−2369
|
[75] |
Wang Zhou, Bovik A C, Sheikh H R, et al. Image quality assessment: From error visibility to structural similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4): 600−612 doi: 10.1109/TIP.2003.819861
|
[76] |
Narvekar N D, Karam L J. A no-reference image blur metric based on the cumulative probability of blur detection (CPBD)[J]. IEEE Transactions on Image Processing, 2011, 20(9): 2678−2683 doi: 10.1109/TIP.2011.2131660
|
[77] |
Heusel M, Ramsauer H, Unterthiner T, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium[J]. arXiv preprint, arXiv: 1706.08500, 2017
|
[78] |
Zhang R, Isola P, Efros A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]//Proc of the 31st IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 586−595
|
[79] |
孙瑜,朱欣娟. 改进 Wav2Lip 的文本音频驱动人脸动画生成[J]. 计算机系统应用,2024,33(2):276−283
Sun Yu, Zhu Xinjuan. Text audio driven facial animation generation based on improved Wav2Lip[J]. Computer Systems & Application, 2024, 33(2): 276−283 (in Chinese)
|
[80] |
Yang Xin, Li Yuezun, Lyu S. Exposing deep fakes using inconsistent head poses[C]//Proc of the 44th IEEE Int Conf on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE , 2019: 8261−8265
|
[81] |
Matern F, Riess C, Stamminger M. Exploiting visual artifacts to expose deepfakes and face manipulations[C]//Proc of the 2019 IEEE/CVF Winter Conf on Applications of Computer Vision Workshops. Piscataway, NJ: IEEE , 2019: 83−92
|
[82] |
Li Lingzhi, Bao Jianmin, Zhang Ting, et al. Face X-ray for more general face forgery detection[C]//Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 5001−5010
|
[83] |
Tan Chuangchuang, Liu Huan, Zhao Yao, et al. Rethinking the up-sampling operations in CNN-based generative network for generalizable deepfake detection[J]. arXiv preprint, arXiv: 2312.10461, 2023
|
[84] |
韦争争. 基于局部纹理差异特征增强的Deepfake检测方法[J/OL]. 重庆工商大学学报:自然科学版,1−8[2024-03-01] . http://kns.cnki.net/kcms/detail/50.1155.N.20231127.1137.008.html
Wei Zhengzheng. Deepfake detection based on local texture difference feature enhancement[J/OL]. Journal of Chongqing Technology and Business University: Natural Sciences Edition, 1−8[2024-03-01]. http://kns.cnki.net/kcms/detail/50.1155.N.20231127.1137.008.html (in Chinese)
|
[85] |
Yang Jianwei, Lei Zhen, Li S Z. Learn convolutional neural network for face anti-spoofing[J]. arXiv preprint, arXiv: 1408.5601, 2014
|
[86] |
Rossler A, Cozzolino D, Verdoliva L, et al. Faceforensics++: Learning to detect manipulated facial images[C]//Proc of the 17th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 1−11
|
[87] |
Tan Mingxing, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks[C]//Proc of the 36th Int Conf on Machine Learning. New York: ACM, 2019: 6105−6114
|
[88] |
Li Yuezun, Lyu S. Exposing deepfake videos by detecting face warping artifacts[J]. arXiv preprint, arXiv: 1811.00656, 2018
|
[89] |
Zhao Hanqing, Zhou Wenbo, Chen Dongdong, et al. Multi-attentional deepfake detection[C]//Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 2185−2194
|
[90] |
Cao Junyi, Ma Chao, Yao Taiping, et al. End-to-end reconstruction-classification learning for face forgery detection[C]//Proc of the 35th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 4113−4122
|
[91] |
Wodajo D, Atnafu S, Akhtar Z. Deepfake video detection using generative convolutional vision transformer[J]. arXiv preprint, arXiv: 2307.07036, 2023
|
[92] |
Yan Zhiyuan, Zhang Yong, Fan Yanbo, et al. UCF: Uncovering common features for generalizable deepfake detection[J]. arXiv preprint, arXiv: 2304.13949, 2023
|
[93] |
Koopman M, Rodriguez A M, Geradts Z. Detection of deepfake video manipulation[C]//Proc of the 20th Irish Machine Vision and Image Processing Conf. Dublin, Ireland: IPRCS, 2018: 133−136
|
[94] |
Fernandes S, Raj S, Ortiz E, et al. Predicting heart rate variations of deepfake videos using neural ode[C]//Proc of the 17th IEEE/CVF Int Conf on Computer Vision Workshops. Piscataway, NJ: IEEE, 2019: 1721−1729
|
[95] |
Qi Hua, Guo Qing, Xu J F, et al. Deeprhythm: Exposing deepfakes with attentional visual heartbeat rhythms[C]//Proc of the 28th ACM Int Conf on Multimedia. New York: ACM, 2020: 4318−4327
|
[96] |
Amerini I, Galteri L, Caldelli R, et al. Deepfake video detection through optical flow based cnn[C]//Proc of the 17th IEEE/CVF Int Conf on Computer Vision Workshops. Piscataway, NJ: IEEE, 2019: 1205−1207
|
[97] |
Knafo G, Fried O. FakeOut: Leveraging out-of-domain self-supervision for multi-modal video deepfake detection[J]. arXiv preprint, arXiv: 2212.00773, 2022
|
[98] |
Wang Tianyi, Chow K P. Noise based deepfake detection via multi-head relative-interaction[C]//Proc of the 37th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2023: 14548−14556
|
[99] |
Li Yuezun, Chang M C, Lyu S. In ictu oculi: Exposing ai created fake videos by detecting eye blinking[C/OL]//Proc of the 10th IEEE Int Workshop on Information Forensics and Security. Piscataway, NJ: IEEE, 2018[2024-3-15]. https://ieeexplore.ieee.org/document/8630787
|
[100] |
Liu Weifeng, She Tianyi, Liu Jiawei, et al. Lips are lying: Spotting the temporal inconsistency between audio and visual in lip-syncing deepfakes[J]. arXiv preprint, arXiv: 2401.15668, 2024
|
[101] |
Koeshunov P, Marcel S. Deepfakes: A new threat to face recognition?Assessment and detection[J]. arXiv preprint, arXiv: 1812.08685, 2018
|
[102] |
Sanderson C. The VidTIMIT Database[DB/OL].[2024-07-08]. http://conradsanderson.id.au/vidtimit/
|
[103] |
Dolhansky B, Bitton J, Pflaum B, et al. The deepfake detection challenge (dfdc) dataset[J]. arXiv preprint, arXiv: 2006.07397, 2020
|
[104] |
Jiang Liming, Li Ren, Wu W, et al. Deeperforensics−1.0: A large-scale dataset for real-world face forgery detection[C]//Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 2889−2898
|
[105] |
Zhou Tianfei, Wang Wenguan, Liang Zhiyuan, et al. Face forensics in the wild[C]//Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 5778−5788
|
[106] |
He Yinan, Gan Bei, Chen Siyu, et al. Forgerynet: A versatile benchmark for comprehensive forgery analysis[C]//Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 4360−4369
|
[107] |
Korshunov P, Marcel S. Improving generalization of deepfake detection with data farming and few-shot learning[J]. IEEE Transactions on Biometrics, Behavior, and Identity Science, 2022, 4(3): 386−397 doi: 10.1109/TBIOM.2022.3143404
|
[108] |
McCool C, Marcel S, Hadid A, et al. Bi-modal person recognition on a mobile phone: using mobile phone data[C]///Proc of the 2012 IEEE Int Conf on Multimedia and Expo Workshops. Piscataway, NJ: IEEE, 2012: 635−640
|
[109] |
Li Gen, Zhao Xianfeng, Cao Yun, et al. Fmfcc-v: An asian large-scale challenging dataset for deepfake detection[C]//Proc of the 10th ACM Workshop on Information Hiding and Multimedia Security. New York: ACM, 2022: 7−18
|
[110] |
Dagar D, Vishwakarma D K. Div-Df: A diverse manipulation deepfake video dataset[C/OL]//Proc of the 2023 Global Conf on Information Technologies and Communications. Piscataway, NJ: IEEE, 2023[2024-03-15]. https://ieeexplore.ieee.org/document/10426446
|
[111] |
Narayan K, Agarwal H, Thakral K, et al. Df-platter: Multi-face heterogeneous deepfake dataset[C]//Proc of the 36th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2023: 9739−9748
|
[112] |
Cai Zhixi, Ghosh S, Adatia A P, et al. AV-Deepfake1M: A large-scale LLM-driven audio-visual deepfake dataset[J]. arXiv preprint, arXiv: 2311.15308, 2023
|
[113] |
董琳,黄丽清,叶锋. 人脸伪造检测泛化性方法综述[J]. 计算机科学,2022,49(2):12−30
Dong Lin, Huang Liqing, Ye Feng, et al. Survey on generalization methods of face forgery detection[J]. Computer Science, 2022, 49(2): 12−30 (in Chinese)
|
[114] |
Carlini N, Farid H. Evading deepfake-image detectors with white-and black-box attacks[C]//Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition Workships. Piscataway, NJ: IEEE, 2020: 658−659
|
[1] | Liu Mingyang, Wang Ruomei, Zhou Fan, Lin Ge. Video Question Answering Scheme Base on Multimodal Knowledge Active Learning[J]. Journal of Computer Research and Development, 2024, 61(4): 889-902. DOI: 10.7544/issn1000-1239.202221008 |
[2] | Tu Rongcheng, Mao Xianling, Kong Weijie, Cai Chengfei, Zhao Wenzhe, Wang Hongfa, Huang Heyan. CLIP Based Multi-Event Representation Generation for Video-Text Retrieval[J]. Journal of Computer Research and Development, 2023, 60(9): 2169-2179. DOI: 10.7544/issn1000-1239.202220440 |
[3] | Li Zeyu, Zhang Xuhong, Pu Yuwen, Wu Yiming, Ji Shouling. A Survey on Multimodal Deepfake and Detection Techniques[J]. Journal of Computer Research and Development, 2023, 60(6): 1396-1416. DOI: 10.7544/issn1000-1239.202111119 |
[4] | Weng Zejia, Chen Jingjing, Jiang Yugang. On the Generalization of Face Forgery Detection with Domain Adversarial Learning[J]. Journal of Computer Research and Development, 2021, 58(7): 1476-1489. DOI: 10.7544/issn1000-1239.2021.20200803 |
[5] | Yu Haitao, Yang Xiaoshan, Xu Changsheng. Antagonistic Video Generation Method Based on Multimodal Input[J]. Journal of Computer Research and Development, 2020, 57(7): 1522-1530. DOI: 10.7544/issn1000-1239.2020.20190479 |
[6] | Tang Jinhui, Li Zechao, Liu Shaoli, Qin Lei. High-Throughput Image and Video Computing[J]. Journal of Computer Research and Development, 2017, 54(6): 1225-1237. DOI: 10.7544/issn1000-1239.2017.20170001 |
[7] | Ma Siwei. History and Recent Developments of AVS Video Coding Standards[J]. Journal of Computer Research and Development, 2015, 52(1): 27-37. DOI: 10.7544/issn1000-1239.2015.20140106 |
[8] | Lu Feng, Wang Zirui, Liao Xiaofei, Jin Hai. Online Video Advertising Based on Fine-Grained Video Tags[J]. Journal of Computer Research and Development, 2014, 51(12): 2733-2745. DOI: 10.7544/issn1000-1239.2014.20131337 |
[9] | Mi Congjie, Liu Yang, and Xue Xiangyang. Video Texts Tracking and Segmentation Based on Multiple Frames[J]. Journal of Computer Research and Development, 2006, 43(9): 1523-1529. |
[10] | Wu Zhiyong and Cai Lianhong. Audio-Visual Bimodal Speaker Identification Using Dynamic Bayesian Networks[J]. Journal of Computer Research and Development, 2006, 43(3): 470-475. |