Dual-Branch Speech Enhancement Neural Network with Convolutional Enhancement Window Attention

Zhang Chenhui; Yuan Zhi'an; Qian Yuhua

doi:10.7544/issn1000-1239.202330751

Zhang Chenhui, Yuan Zhi'an, Qian Yuhua. Dual-Branch Speech Enhancement Neural Network with Convolutional Enhancement Window Attention[J]. Journal of Computer Research and Development, 2025, 62(4): 852-862. DOI: 10.7544/issn1000-1239.202330751

Citation:

Dual-Branch Speech Enhancement Neural Network with Convolutional Enhancement Window Attention

Graphical Abstract

Graphical Abstract

Abstract

Abstract

In complex environments and under sudden background noise conditions, speech enhancement tasks are extremely challenging due to the limited capturing of spectrogram features by existing methods, especially in capturing local information of the spectrogram. Previous work on Transformer models primarily focused on global information of the audio while neglecting the importance of local information. Many models only utilized the magnitude information and ignored the phase information after the audio underwent short-time Fourier transform (STFT), resulting in suboptimal capturing of spectrogram features and unsatisfactory speech enhancement results. Based on this, we propose a dual-branch speech enhancement neural network with convolutional enhancement window attention. The model adopts a U-NET architecture and simultaneously models the magnitude and phase information of the audio through the dual-branch structure. A complex computation module is introduced for information interaction between the two branches. The convolutional enhancement window attention module is employed in the skip-connection part between the encoder and decoder layers. It performs self-attention based on non-overlapping windows, significantly reducing the computational complexity of the speech enhancement model while capturing local contextual information. The proposed model is evaluated on the publicly available Voicebank-Demand dataset. Compared with the baseline models DCUNET 16 and DCUNET20, the proposed model achieves improvements of 0.51 and 0.47, respectively, in PESQ (perceptual evaluation of speech quality) metric. Other evaluation metrics also show significant enhancements. Compared with various existing speech enhancement models, the proposed model outperforms them in various metrics, particularly demonstrating remarkable improvements in PESQ scores.

FullText(HTML)

References (44)

Cited By

Turn off MathJax

Article Contents

Dual-Branch Speech Enhancement Neural Network with Convolutional Enhancement Window Attention

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content