Abstract:
Monaural speech enhancement aims to recover clean speech from complex noise scenes, thus improving the quality of the noise-corrupted voice signals. This problem has been studied for decades. In recent years, convolutional encoder-decoder neural networks have been widely used in speech enhancement tasks. The convolutional models reflect strong correlations of speech in time and can extract important voiceprint features. However, two challenges still remain. Firstly, skip connection mechanisms widely used in recent state-of-the-art methods introduce noise components in the transmission of feature information, which degrades the denoising performance inevitably; Secondly, widely used standard fix-shaped convolution kernels are inefficient of dealing with various voiceprints due to their limitation of receptive field. Taking into consideration the above concerns, we propose a novel end-to-end encoder-decoder-based network CADNet that incorporates the cross-dimensional collaborative attention mechanism and deformable convolution modules. In specific, we insert cross-dimensional collaborative attention blocks into skip connections to further facilitate the ability of voice information control. In addition, we introduce a deformable convolution layer after each standard convolution layer in order to better match the natural characteristics of voiceprints. Experiments conducted on the TIMIT open corpus verify the effectiveness of the proposed architecture in terms of objective intelligibility and quality metrics.