Abstract:
With the rapid advancement of artificial intelligence generation models and deepfakes, the techniques for generating talking face videos using various methods have become increasingly mature. Among them, audio-driven talking face video generation methods have attracted significant attention due to their remarkably realistic and natural output. Such methods utilize audio as a driving source to synthesize videos where the target character’s mouth movements synchronize with the audio, often combining image or video materials. Currently, these technologies are widely applied in fields such as virtual anchors, gaming animation, and film and television production, demonstrating vast prospects for development. However, the potential negative impacts of this technology are also becoming apparent. Improper or abusive use could lead to serious political and economic consequences. In this context, research on identifying various types of facial forgery videos has emerged. This research primarily assesses the authenticity of videos by detecting the veracity of individual video frames or the spatio-temporal consistency of video sequences. Firstly, this paper systematically analyzes the classic algorithms and latest advancements in audio-driven talking face video generation tasks based on the timeline and the development history of foundational models. Secondly, it exhaustively lists the commonly used datasets and evaluation criteria for this task, conducting comprehensive comparisons across multiple dimensions. Subsequently, the paper meticulously analyzes and summarizes the forgery facial video identification task, categorizing it based on whether the discrimination technology focuses on individual video frames or multiple frames, and also summarizes its commonly used datasets and evaluation criteria. Finally, the paper outlines the challenges and future directions in this research field, aiming to provide valuable references and support for subsequent related research.