音频驱动的说话人面部视频生成与鉴别综述

乐铮; 胡永婷; 徐勇

doi:10.7544/issn1000-1239.202440207

音频驱动的说话人面部视频生成与鉴别综述

Survey of Audio-Driven Talking Face Video Generation and Identification

摘要

摘要: 随着人工智能生成模型和深度伪造的迅速兴起，利用各种方法生成人脸说话视频的技术日益成熟，其中音频驱动的说话人面部视频生成方法因其生成效果的逼真自然而备受瞩目. 该类方法利用音频作为驱动源，结合图片或视频素材，用以合成与输入音频口型同步的目标角色讲话视频. 目前，相应的技术已经被广泛应用于虚拟主播、游戏动漫、影视剧制作等内容创作领域，并展现出广阔的发展前景. 然而，这些技术的潜在负面影响也日益显现，若被不当利用或滥用，极有可能触发严重的政治和经济后果. 在此背景下，针对面部伪造视频的鉴别研究应运而生，主要通过分析单视频帧的真实性或视频帧序列的时空一致性来评估视频的真实性. 首先，依据时间脉络及基础模型的发展轨迹，系统性地剖析了音频驱动面部视频生成任务的经典算法与最新研究成果. 其次，详尽列举了该任务领域内常用的数据集及评估标准，并从多个维度出发，对这些数据集与标准进行了全面深入的对比分析. 紧接着，针对伪造面部视频鉴别任务，对鉴别技术所针对的对象（即单帧或多帧）进行了细致的分类与归纳，同时，也对其常用的数据集及评估标准进行了系统的总结与梳理. 最后，展望了该研究领域面临的挑战与未来的发展方向，旨在为后续的相关研究提供有价值的参考与坚实的支撑.

Abstract: With the rapid advancement of artificial intelligence generation models and deepfakes, the techniques for generating talking face videos using various methods have become increasingly mature. Among them, audio-driven talking face video generation methods have attracted significant attention due to their remarkably realistic and natural output. Such methods utilize audio as a driving source to synthesize videos where the target character’s mouth movements synchronize with the audio, often combining image or video materials. Currently, these technologies are widely applied in fields such as virtual anchors, gaming animation, and film and television production, demonstrating vast prospects for development. However, the potential negative impacts of this technology are also becoming apparent. Improper or abusive use could lead to serious political and economic consequences. In this context, research on identifying various types of facial forgery videos has emerged. Our research primarily assesses the authenticity of videos by detecting the veracity of individual video frames or the spatio-temporal consistency of video sequences. Firstly, we systematically analyze the classic algorithms and latest advancements in audio-driven talking face video generation tasks based on the timeline and the development history of foundational models. Secondly, we exhaustively list the commonly used datasets and evaluation criteria for this task, conducting comprehensive comparisons across multiple dimensions. Subsequently, we meticulously analyze and summarize the forgery facial video identification task, categorizing it based on whether the discrimination technology focuses on individual video frames or multiple frames, and also summarize its commonly used datasets and evaluation criteria. Finally, we outline the challenges and future directions in this research field, aiming to provide valuable references and support for subsequent related research.

HTML全文

参考文献(114)

施引文献

资源附件(0)