Dual-Branch Speech Enhancement Neural Network with Convolutional Enhancement Window Attention

Zhang Chenhui; Yuan Zhi'an; Qian Yuhua

doi:10.7544/issn1000-1239.202330751

Journal of Computer Research and Development > 2025 > 62(4): 852-862. > DOI: 10.7544/issn1000-1239.202330751 CSTR: 32373.14.issn1000-1239.202330751

Zhang Chenhui, Yuan Zhi'an, Qian Yuhua. Dual-Branch Speech Enhancement Neural Network with Convolutional Enhancement Window Attention[J]. Journal of Computer Research and Development, 2025, 62(4): 852-862. DOI: 10.7544/issn1000-1239.202330751

Citation:

PDF (1870 KB)

Dual-Branch Speech Enhancement Neural Network with Convolutional Enhancement Window Attention

Institute of Big Data Science and Industry, Shanxi university, Taiyuan 030006
Shanxi Machine Vision and data Mining Engineering Research Center (Shanxi university), Taiyuan 030006

Funds: This work was supported by the Key Program of National Natural Science Foundation of China (62136005), the National Science and Technology Major Project (2021ZD0112400), and the Shanxi Provincial Science and Technology Major Special Plan "Unveiled" Project (202201020101006).

More Information

Author Bio:
Zhang Chenhui: born in 1999. Master candidate. Student member of CCF. His main research interest includes speech enhancement

Yuan Zhi'an: born in 1998. PhD candidate. Student member of CCF. His main research interests include signal enhancement and machine learning

Qian Yuhua: born in 1976. PhD, professor, PhD supervisor. Member of CCF. His main research interests include artificial intelligence, big data, machine learning and data mining
Received Date: September 19, 2023
Revised Date: November 18, 2024
Accepted Date: January 07, 2025
Available Online: January 20, 2025

Graphical Abstract

Abstract

Abstract

In complex environments and under sudden background noise conditions, speech enhancement tasks are extremely challenging due to the limited capturing of spectrogram features by existing methods, especially in capturing local information of the spectrogram. Previous work on Transformer models primarily focused on global information of the audio while neglecting the importance of local information. Many models only utilized the magnitude information and ignored the phase information after the audio underwent short-time Fourier transform (STFT), resulting in suboptimal capturing of spectrogram features and unsatisfactory speech enhancement results. Based on this, we propose a dual-branch speech enhancement neural network with convolutional enhancement window attention. The model adopts a U-NET architecture and simultaneously models the magnitude and phase information of the audio through the dual-branch structure. A complex computation module is introduced for information interaction between the two branches. The convolutional enhancement window attention module is employed in the skip-connection part between the encoder and decoder layers. It performs self-attention based on non-overlapping windows, significantly reducing the computational complexity of the speech enhancement model while capturing local contextual information. The proposed model is evaluated on the publicly available Voicebank-Demand dataset. Compared with the baseline models DCUNET 16 and DCUNET20, the proposed model achieves improvements of 0.51 and 0.47, respectively, in PESQ (perceptual evaluation of speech quality) metric. Other evaluation metrics also show significant enhancements. Compared with various existing speech enhancement models, the proposed model outperforms them in various metrics, particularly demonstrating remarkable improvements in PESQ scores.
- speech enhancement,
- dual-branch network,
- spectrogram features,
- convolutional enhancement window attention,
- global information,
- local information

FullText(HTML)

References (44)

References

[1]	Lim J, Oppenheim A. All-pole modeling of degraded speech[J]. IEEE Transactions on Acoustics, Speech and Signal Processing, 1978, 26(3): 197−210 doi: 10.1109/TASSP.1978.1163086
[2]	Boll S. Suppression of acoustic noise in speech using spectral subtraction[J]. IEEE Transactions on Acoustics, Speech and Signal Processing, 1979, 27(2): 113−120 doi: 10.1109/TASSP.1979.1163209
[3]	Ephraim Y, Van Trees H. A signal subspace approach for speechenhancement[J]. IEEE Transactions on Speech and Audio Processing, 1995, 3(4): 251−266 doi: 10.1109/89.397090
[4]	时文华,倪永婧,张雄伟,等. 联合稀疏非负矩阵分解和神经网络的语音增强[J]. 计算机研究与发展,2018,55(11):2430−2438 doi: 10.7544/issn1000-1239.2018.20170580 Shi Wenhua, Ni Yongjing, Zhang Xiongwei, et al. Deep neural network based monaural speech enhancement with sparse nonnegative matrix factorization[J]. Journal of Computer Research and Development, 2018, 55(11): 2430−2438 (in Chinese) doi: 10.7544/issn1000-1239.2018.20170580
[5]	Ali M N, Brutti A, Falavigna D. Speech enhancement using dilated Wave-U-Net: An experimental analysis[C]//Proc of the 27th Conf of Open Innovations Association (FRUCT). Piscataway, NJ: IEEE, 2020: 3−9
[6]	Zhang Q,Nicolson A,Wang M,et al. DeepMMSE:A deep learning approach to MMSE-based noise power spectral density estimation[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2020,28:1404−1415
[7]	Park S R, Lee J. A fully convolutional neural network for speech enhancement[J]. arXiv preprint, arXiv: 1609.07132, 2016
[8]	Pandey A, Wang Deliang. Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain[C]//Proc of the 45th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 6629−6633
[9]	Pandey A, Wang Deliang. Dual-path self-attention RNN for real-time speech enhancement[J]. arXiv preprint, arXiv: 2010.12713, 2020
[10]	Ye Moujia,Wan Hongjie. Improved transformer-based dual-path network with amplitude and complex domain feature fusion for speech enhancement[J]. Entropy,2023,25(2):228
[11]	Yu Guochen, Li Andong, Zheng Chenshi, et al. Dual-branch attention-in-attention transformer for single-channel speech enhancement[C]//Proc of the 47th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7847−7851
[12]	Lee J,Kang H G. Real-time neural speech enhancement based on temporal refinement network and channel-wise gating methods[J]. Digital Signal Processing,2023,133:103879
[13]	Kong Z, Ping W, Dantrey A, et al. Speech denoising in the waveform domain with self-attention[C]//Proc of the 47th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7867−7871
[14]	Baby D, Verhulst S. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty[C]//Proc of the 44th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 106−110
[15]	Hao Xiang, Su Xiangdong, Horaud R, et al. Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement[C]//Proc of the 46th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 6633−6637
[16]	Tan K, Wang Deliang. A convolutional recurrent neural network for real-time speech enhancement[C]//Proc of INTERSPEECH 2018. Grenoble, France: ISCA, 2018: 3229−3233
[17]	Kim J, El-Khamy M, Lee J. T-GSA: Transformer with Gaussian-weighted self-attention for speech enhancement[C]//Proc of the 45th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 6649−6653
[18]	Pandey A,Wang Deliang. Dense CNN with self-attention for time-domain speech enhancement[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:1270-1279
[19]	Erdogan H, Hershey J R, Watanabe S, et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks[C]//Proc of the 45th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2015: 708−712
[20]	Williamson D S, Wang Yuxuan, Wang Deliang. Complex ratio masking for monaural speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 24(3): 483−492
[21]	Choi H S, Kim J H, Huh J, et al. Phase-aware speech enhancement with deep complex U-NET[J]. arXiv preprint, arXiv: 1903.03107, 2019
[22]	Macartney C, Weyde T. Improved speech enhancement with the wave-U-NET[J]. arXiv preprint, arXiv: 1811.11307, 2018
[23]	Anmol G, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition[J]. arXiv preprint, arXiv: 2005.08100, 2020
[24]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proc of the 31st Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2017: 5998−6008
[25]	Liu Ze, Lin Yutong, Cao Yue, et al. Swin Transformer: Hierarchical vision transformer using shifted windows[C]//Proc of the 39th IEEE/CVF Int Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 10012−10022
[26]	Koizumi Y, Harada N, Haneda Y. Trainable adaptive window switching for speech enhancement[C]//Proc of the 49th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 616−620
[27]	Parvathala V, Andhavarapu S, Pamisetty G, et al. Neural comb filtering using sliding window attention network for speech enhancement[J]. Circuits, Systems, and Signal Processing, 2023, 42(1): 322−343 doi: 10.1007/s00034-022-02123-2
[28]	Liang Xinyan, Qian Yuhua, Guo Qian, et al. AF: An association-based fusion method for multi-modal classification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(12): 9236−9254
[29]	Hu Yanxin, Liu Yun, Lv Shubo, et al. DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement[J]. arXiv preprint, arXiv: 2008.00264, 2020
[30]	Valentini-Botinhao C, Wang Xin, Takaki S, et al. Investigating RNN-based speech enhancement methods for noise-robust text-to-speech[C]//Proc of the 9th ISCA Int Conf on Speech Synthesis Workshop. Grenoble, France: ISCA, 2016: 146−152
[31]	Peer T, Gerkmann T. Phase-aware deep speech enhancement: It’s all about the frame length[J]. JASA Express Letters, 2022, 2(10), 104802
[32]	Wang Zhedong, Cun Xiaodong, Bao Jianming, et al. Uformer: A general U-shaped transformer for image restoration[C]//Proc of the 41st IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 17683−17693
[33]	Luo Yi, Chen Zhuo, Yoshioka T. Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation[C]//Proc of the 50th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 46−50
[34]	Thiemann J, Ito N, Vincent E. The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings[J] Journal of the Acoustical Society of America, 2013, 19(1): 035081
[35]	Loizou P C. Speech Enhancement: Theory and Practice[M]. Boca Raton: CRC Press, 2013
[36]	Hu Yi, Loizou P C. Evaluation of objective quality measures for speech enhancement[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 16(1): 229−238
[37]	Pascual S, Bonafonte A, Serra J. SEGAN: Speech enhancement generative adversarial network[J]. arXiv preprint, arXiv: 1703.09452, 2017
[38]	Tsun-An Hsieh,Wang Hsin-Min,Lu Xuguang,et al. WaveCRN:An efficient convolutional recurrent neural network for end-to-end speech enhancement[J]. IEEE Signal Processing Letters,2020,27:2149−2153
[39]	Wang Kai, He Bengbeng, Zhu Weiping. TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain[C]//Proc of the 51st IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 7098−7102
[40]	Wang Ning,Ma Sihan,Li Jingyuan,et al. Multistage attention network for image inpainting[J]. Pattern Recognition,2020,106:107448
[41]	Fu S W, Yu Cheng, Hsieh T A, et al. MetricGAN+: An improved version of metricGAN for speech enhancement[J]. arXiv preprint, arXiv: 2104.03538, 2021
[42]	黄翔东, 陈红红, 甘霖. 基于频率-时间扩张密集网络的语音增强方法[J]. 计算机研究与发展,2023,60(5):1628−1638 Huang Xiangdong, Chen Honghong, Gan lin. Speech enhancement method based on frequency-time dilated dense network[J]. Journal of Computer Research and Development, 2023, 60(5): 1628−1638 (in Chinese)
[43]	Yu Guochen, Li Andong, Zheng Chengshi, et al. Dual-branch attention-in-attention transformer for single-channel speech enhancement[C]//Proc of the 52nd IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7847−7851
[44]	Yu Guochen,Li Andong,Wang Hui,et al. DBT-Net:Dual-branch federative magnitude and phase estimation with attention-in-attention transformer for monaural speech enhancement[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2022,30:2629−2644

[1]	Wang Honglin, Yang Dan, Nie Tiezheng, Kou Yue. Attributed Heterogeneous Information Network Embedding with Self-Attention Mechanism for Product Recommendation[J]. Journal of Computer Research and Development, 2022, 59(7): 1509-1521. DOI: 10.7544/issn1000-1239.20210016
[2]	Ni Qingjian, Peng Wenqiang, Zhang Zhizheng, Zhai Yuqing. Spatial-Temporal Graph Neural Network for Traffic Flow Prediction Based on Information Enhanced Transmission[J]. Journal of Computer Research and Development, 2022, 59(2): 282-293. DOI: 10.7544/issn1000-1239.20210901
[3]	Cao Jiuxin, Gao Qingqing, Xia Rongqing, Liu Weijia, Zhu Xuelin, Liu Bo. Information Propagation Prediction and Specific Information Suppression in Social Networks[J]. Journal of Computer Research and Development, 2021, 58(7): 1490-1503. DOI: 10.7544/issn1000-1239.2021.20200809
[4]	Zhou Donghao, Han Wenbao, Wang Yongjun. A Fine-Grained Information Diffusion Model Based on Node Attributes and Content Features[J]. Journal of Computer Research and Development, 2015, 52(1): 156-166. DOI: 10.7544/issn1000-1239.2015.20130915
[5]	Ma Xiao, Wang Xuan, and Wang Xiaolong. The Information Model for a Class of Imperfect Information Game[J]. Journal of Computer Research and Development, 2010, 47(12).
[6]	Sun Qindong, Guan Xiaohong, Zhou Yadong. A Survey of Network Information Content Audit[J]. Journal of Computer Research and Development, 2009, 46(8): 1241-1250.
[7]	Tian Mei, Luo Siwei, Huang Yaping, and Zhao Jiali. Extracting Bottom-Up Attention Information Based on Local Complexity and Early Visual Features[J]. Journal of Computer Research and Development, 2008, 45(10): 1739-1746.
[8]	Ni Weiwei, Chen Geng, Lu Jieping, Wu Yingjie, Sun Zhihui. Local Entropy Based Weighted Subspace Outlier Mining Algorithm[J]. Journal of Computer Research and Development, 2008, 45(7): 1189-1194.
[9]	Liu Yunhui, Luo Siwei, Huang Hua, and Li Aijun. Information Geometric Analysis of Pruning Algorithm[J]. Journal of Computer Research and Development, 2006, 43(9): 1609-1614.
[10]	Dong Wenyu, Sun Donghong, Xu Ke, Li Xuedong. Modeling of Autonomous Network Information Service[J]. Journal of Computer Research and Development, 2006, 43(2): 224-230.