Citation: | Zhang Chenhui, Yuan Zhi'an, Qian Yuhua. Dual-Branch Speech Enhancement Neural Network with Convolutional Enhancement Window Attention[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202330751 |
In complex environments and under sudden background noise conditions, speech enhancement tasks are extremely challenging due to the limited capturing of spectrogram features by existing methods, especially in capturing local information of the spectrogram. Previous work on Transformer models primarily focused on global information of the audio while neglecting the importance of local information. Many models only utilized the magnitude information and ignored the phase information after the audio underwent short-time Fourier transform (STFT), resulting in suboptimal capturing of spectrogram features and unsatisfactory speech enhancement results. Based on this, we propose a dual-branch speech enhancement neural network with convolutional enhancement window attention. The model adopts a U-NET architecture and simultaneously models the magnitude and phase information of the audio through the dual-branch structure. A complex computation module is introduced for information interaction between the two branches. The convolutional enhancement window attention module is employed in the skip-connection part between the encoder and decoder layers. It performs self-attention based on non-overlapping windows, significantly reducing the computational complexity of the speech enhancement model while capturing local contextual information. The proposed model is evaluated on the publicly available Voicebank-Demand dataset. Compared with the baseline models DCUNET 16 and DCUNET20, the proposed model achieves improvements of 0.51 and 0.47, respectively, in PESQ (perceptual evaluation of speech quality) metric. Other evaluation metrics also show significant enhancements. Compared with various existing speech enhancement models, the proposed model outperforms them in various metrics, particularly demonstrating remarkable improvements in PESQ scores.
[1] |
Lim J, Oppenheim A. All-pole modeling of degraded speech[J]. IEEE Transactions on Acoustics, Speech and Signal Processing, 1978, 26(3): 197−210 doi: 10.1109/TASSP.1978.1163086
|
[2] |
Boll S. Suppression of acoustic noise in speech using spectral subtraction[J]. IEEE Transactions on Acoustics, Speech and Signal Processing, 1979, 27(2): 113−120 doi: 10.1109/TASSP.1979.1163209
|
[3] |
Ephraim Y, Van Trees H. A signal subspace approach for speechenhancement[J]. IEEE Transactions on Speech and Audio Processing, 1995, 3(4): 251−266 doi: 10.1109/89.397090
|
[4] |
时文华,倪永婧,张雄伟,等. 联合稀疏非负矩阵分解和神经网络的语音增强[J]. 计算机研究与发展,2018,55(11):2430−2438 doi: 10.7544/issn1000-1239.2018.20170580
Shi Wenhua, Ni Yongjing, Zhang Xiongwei, et al. Deep neural network based monaural speech enhancement with sparse nonnegative matrix factorization[J]. Journal of Computer Research and Development, 2018, 55(11): 2430−2438 (in Chinese) doi: 10.7544/issn1000-1239.2018.20170580
|
[5] |
Ali M N, Brutti A, Falavigna D. Speech enhancement using dilated Wave-U-Net: An experimental analysis[C]//Proc of the 27th Conf of Open Innovations Association (FRUCT). Piscataway, NJ: IEEE, 2020: 3−9
|
[6] |
Zhang Q,Nicolson A,Wang M,et al. DeepMMSE:A deep learning approach to MMSE-based noise power spectral density estimation[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2020,28:1404−1415
|
[7] |
Park S R, Lee J. A fully convolutional neural network for speech enhancement[J]. arXiv preprint, arXiv: 1609.07132, 2016
|
[8] |
Pandey A, Wang Deliang. Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain[C]//Proc of the 45th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 6629−6633
|
[9] |
Pandey A, Wang Deliang. Dual-path self-attention RNN for real-time speech enhancement[J]. arXiv preprint, arXiv: 2010.12713, 2020
|
[10] |
Ye Moujia,Wan Hongjie. Improved transformer-based dual-path network with amplitude and complex domain feature fusion for speech enhancement[J]. Entropy,2023,25(2):228
|
[11] |
Yu Guochen, Li Andong, Zheng Chenshi, et al. Dual-branch attention-in-attention transformer for single-channel speech enhancement[C]//Proc of the 47th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7847−7851
|
[12] |
Lee J,Kang H G. Real-time neural speech enhancement based on temporal refinement network and channel-wise gating methods[J]. Digital Signal Processing,2023,133:103879
|
[13] |
Kong Z, Ping W, Dantrey A, et al. Speech denoising in the waveform domain with self-attention[C]//Proc of the 47th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7867−7871
|
[14] |
Baby D, Verhulst S. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty[C]//Proc of the 44th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 106−110
|
[15] |
Hao Xiang, Su Xiangdong, Horaud R, et al. Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement[C]//Proc of the 46th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 6633−6637
|
[16] |
Tan K, Wang Deliang. A convolutional recurrent neural network for real-time speech enhancement[C]//Proc of INTERSPEECH 2018. Grenoble, France: ISCA, 2018: 3229−3233
|
[17] |
Kim J, El-Khamy M, Lee J. T-gsa: Transformer with Gaussian-weighted self-attention for speech enhancement[C]//Proc of the 45th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 6649−6653
|
[18] |
Pandey A,Wang Deliang. Dense CNN with self-attention for time-domain speech enhancement[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:1270-1279
|
[19] |
Erdogan H, Hershey J R, Watanabe S, et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks[C]//Proc of the 45th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2015: 708−712
|
[20] |
Williamson D S, Wang Yuxuan, Wang Deliang. Complex ratio masking for monaural speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 24(3): 483−492
|
[21] |
Choi H S, Kim J H, Huh J, et al. Phase-aware speech enhancement with deep complex U-NET[J]. arXiv preprint, arXiv: 1903.03107, 2019
|
[22] |
Macartney C, Weyde T. Improved speech enhancement with the wave-U-NET[J]. arXiv preprint, arXiv: 1811.11307, 2018
|
[23] |
Anmol G, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition[J]. arXiv preprint, arXiv: 2005.08100, 2020
|
[24] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proc of the 31st Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2017: 5998−6008
|
[25] |
Liu Ze, Lin Yutong, Cao Yue, et al. Swin Transformer: Hierarchical vision transformer using shifted windows[C]//Proc of the 39th IEEE/CVF Int Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 10012−10022
|
[26] |
Koizumi Y, Harada N, Haneda Y. Trainable adaptive window switching for speech enhancement[C]//Proc of the 49th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 616−620
|
[27] |
Parvathala V, Andhavarapu S, Pamisetty G, et al. Neural comb filtering using sliding window attention network for speech enhancement[J]. Circuits, Systems, and Signal Processing, 2023, 42(1): 322−343 doi: 10.1007/s00034-022-02123-2
|
[28] |
Liang Xinyan, Qian Yuhua, Guo Qian, et al. AF: An association-based fusion method for multi-modal classification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(12): 9236−9254
|
[29] |
Hu Yanxin, Liu Yun, Lv Shubo, et al. DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement[J]. arXiv preprint, arXiv: 2008.00264, 2020
|
[30] |
Valentini-Botinhao C, Wang Xin, Takaki S, et al. Investigating RNN-based speech enhancement methods for noise-robust text-to-speech[C]//Proc of the 9th ISCA Int Conf on Speech Synthesis Workshop. Grenoble, France: ISCA, 2016: 146−152
|
[31] |
Peer T, Gerkmann T. Phase-aware deep speech enhancement: It’s all about the frame length[J]. JASA Express Letters, 2022, 2(10), 104802
|
[32] |
Wang Zhedong, Cun Xiaodong, Bao Jianming, et al. Uformer: A general U-shaped transformer for image restoration[C]//Proc of the 41st IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 17683−17693
|
[33] |
Luo Yi, Chen Zhuo, Yoshioka T. Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation[C]//Proc of the 50th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 46−50
|
[34] |
Thiemann J, Ito N, Vincent E. The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings[J] Journal of the Acoustical Society of America, 2013, 19(1): 035081
|
[35] |
Loizou P C. Speech Enhancement: Theory and Practice[M]. Boca Raton: CRC Press, 2013
|
[36] |
Hu Yi, Loizou P C. Evaluation of objective quality measures for speech enhancement[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 16(1): 229−238
|
[37] |
Pascual S, Bonafonte A, Serra J. SEGAN: Speech enhancement generative adversarial network[J]. arXiv preprint, arXiv: 1703.09452, 2017
|
[38] |
Tsun-An Hsieh,Wang Hsin-Min,Lu Xuguang,et al. WaveCrn:An efficient convolutional recurrent neural network for end-to-end speech enhancement[J]. IEEE Signal Processing Letters,2020,27:2149−2153
|
[39] |
Wang Kai, He Bengbeng, Zhu Weiping. TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain[C]//Proc of the 51st IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 7098−7102
|
[40] |
Wang Ning,Ma Sihan,Li Jingyuan,et al. Multistage attention network for image inpainting[J]. Pattern Recognition,2020,106:107448
|
[41] |
Fu S W, Yu Cheng, Hsieh T A, et al. MetricGAN+: An improved version of metricGAN for speech enhancement[J]. arXiv preprint, arXiv: 2104.03538, 2021
|
[42] |
黄翔东, 陈红红, 甘霖. 基于频率-时间扩张密集网络的语音增强方法[J]. 计算机研究与发展,2023,60(5):1628−1638
Huang Xiangdong, Chen Honghong, Gan lin. Speech enhancement method based on frequency-time dilated dense network[J]. Journal of Computer Research and Development, 2023, 60(5): 1628−1638 (in Chinese)
|
[43] |
Yu Guochen, Li Andong, Zheng Chengshi, et al. Dual-branch attention-in-attention transformer for single-channel speech enhancement[C]//Proc of the 52nd IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7847−7851
|
[44] |
Yu Guochen,Li Andong,Wang Hui,et al. DBT-Net:Dual-branch federative magnitude and phase estimation with attention-in-attention transformer for monaural speech enhancement[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2022,30:2629−2644
|