文本－视觉语音合成综述

王志明; 陶建华

文本－视觉语音合成综述

A Review of Text-to-Visual Speech Synthesis

摘要

摘要: 视觉信息对于理解语音的内容非常重要.不只是听力有障碍的人，普通人在交谈过程中也存在着一定程度的唇读，尤其是在语音质量受损的噪声环境下.正如文语转换系统可以使计算机像人一样讲话，文本－视觉语音合成系统可以使计算机模拟人类语音的双模态性，让计算机界面变得更为友好.回顾了文本－视觉语音合成的发展.文本驱动的视觉语音合成的实现方法可以分为两类：基于参数控制的方法和基于数据驱动的方法.详细介绍了参数控制类中的几个关键问题和数据驱动类中的几种不同实现方法，比较了这两类方法的优缺点及不同的适用环境.

Abstract: Visual information is important to the understanding of speech. Not only hearing-impaired people, but people with normal hearing also make use of visual information that accompanies speech, especially when the acoustic speech is degradedin the noise environment. As text-to-speech (TTS) synthesis makes computer speak like human, text-to-visual speech (TTVS) synthesis by computer face animation can incorporate bimodality of speech into human-computer interaction interface in order to make it friendly. The state-of-the-art of text-to-visual speech synthesis research is reviewed. Two classes of approaches, parameter control approach and data driven approach, are developed in visual speech synthesis. For the parameter control approach, three key problems are discussed: face model construction, animation control parameters definition, and the dynamic properties of control parameters. For the data driven approach, three main methods are introduced: video slice concatenation, key frame morphing, and face components combination. Finally, the advantages and disadvantages of each approach are discussed.

HTML全文

参考文献(0)

施引文献

资源附件(0)