Abstract:
Speech enhancement in noisy circumstances is one of the important research directions of speech signal processing, which plays an important role in improving the quality of voice video call and enhancing the performance of human-computer interaction and speech recognition. Therefore, we propose a network based on the dilated convolution and the dense connection, which effectively improves the feature expression ability of the network by learning the context information of frequency and time directions of speech spectrogram. Specifically, the proposed structure integrates dilated convolution into the basic unit of time and frequency processing, which can ensure that a large enough receptive field can be obtained in the frequency direction and time direction to extract deep speech features; at the same time, the dense connection is applied to the cascade structure of these two basic units, which can avoid the loss of information caused by the cascade of multiple processing modules, so as to enhance the efficiency of feature utilization. Experimental results show that the proposed speech enhancement network can achieve high scores in PESQ, STOI and a series of subjective mean opinions, showing overall superiority over the existing speech enhancement networks. Besides, the generalization ability to varieties of noisy conditions is also evaluated in these experiments.