Joint Acoustic Modeling of Multi-Features Based on Deep Neural Networks

Fan Zhengguang; Qu Dan; Yan Honggang; Zhang Wenlin

doi:10.7544/issn1000-1239.2017.20160031

Journal of Computer Research and Development > 2017 > 54(5): 1036-1044. > DOI: 10.7544/issn1000-1239.2017.20160031 CSTR: 32373.14.issn1000-1239.2017.20160031

Fan Zhengguang, Qu Dan, Yan Honggang, Zhang Wenlin. Joint Acoustic Modeling of Multi-Features Based on Deep Neural Networks[J]. Journal of Computer Research and Development, 2017, 54(5): 1036-1044. DOI: 10.7544/issn1000-1239.2017.20160031

Citation:

PDF (1725 KB)

Joint Acoustic Modeling of Multi-Features Based on Deep Neural Networks

(Institute of Information System Engineering, PLA Information Engineering University, Zhengzhou 450002)

More Information

Published Date: April 30, 2017

Graphical Abstract

Abstract

Abstract

In view of the complementary information and the relevance when training acoustic modes of different acoustic features, a joint acoustic modeling method of multi-features based on deep neural networks is proposed. In this method, similar to DNN multimodal and multitask learning, part of the DNN hidden layers are shared to make the association among the DNN acoustic models built with different features. Through training the acoustic models together, the common hidden explanatory factors are exploited among different learning tasks. Such exploitation allows the possibility of knowledge transferring across different learning tasks. Moreover, the number of the model parameters is decreased by using the low-rank matrix factorization method to reduce the training time. Lastly, the recognition results from different acoustic features are combined by using recognizer output voting error reduction (ROVER) algorithm to further improve the performance. Experimental results of continuous speech recognition on TIMIT database show that the joint acoustic modeling method performs better than modeling independently with different features. In terms of phone error rates (PER), the result combined by ROVER based on the joint acoustic models yields a relative gain of 4.6% over the result based on the independent acoustic models.
- speech recognition,
- deep neural network (DNN),
- acoustic models,
- low-rank matrix factorization,
- fusion