Abstract:
Studied in this paper is the use of dynamic Bayesian networks (DBNs) for the task of text prompt audio-visual bimodal speaker identification. The task is to determine the identity of a speaker from a temporal sequence of audio and visual observations obtained from the acoustic speech and the shape of the mouth respectively. According to the hierarchical structure of audio-visual bimodal modeling, a new DBN is constructed to describe the natural audio and visual state asynchrony as well as their conditional dependency over time. The experimental results show that the dynamic Bayesian network is a powerful and flexible methodology for representing and modeling the audio-visual correlations and the proposed DBN can improve the accuracy of audio-only speaker identification at all levels of acoustic signal-to-noise ratio (SNR) from 0 to 30dB.