Abstract:
In complex environments and under sudden background noise conditions,speech enhancement tasks are extremely challenging due to the limited capturing of spectrogram features by existing methods, especially in capturing local information of the spectrogram. Previous works on Transformer models primarily focused on global information of the audio while neglecting the importance of local information. Many models only utilized the magnitude information and ignored the phase information after the audio underwent Short-Time Fourier Transform (STFT), resulting in suboptimal capturing of spectrogram features and unsatisfactory speech enhancement results.Based on this, this paper proposes a dual-branch speech enhancement neural network with convolutional enhancement window attention. The model adopts a U-Net architecture and simultaneously models the magnitude and phase information of the audio through the dual-branch structure. A complex computation module is introduced for information interaction between the two branches. The convolutional enhancement window attention module is employed in the skip-connection part between the encoder and decoder layers. It performs self-attention based on non-overlapping windows, significantly reducing the computational complexity of the speech enhancement model while capturing local contextual information. The proposed model is evaluated on the publicly available Voicebank-Demand dataset. Compared to the baseline models DCUNET16 and DCUNET20, it achieves improvements of 0.51 and 0.47, respectively, in the perceptual evaluation of speech quality (PESQ) metric. Other evaluation metrics also show significant enhancements. Compared to various existing speech enhancement models, the proposed model outperforms them in various metrics, particularly demonstrating remarkable improvements in PESQ scores.