结合卷积增强窗口注意力的双分支语音增强神经网络

张晨辉; 原之安; 钱宇华

doi:10.7544/issn1000-1239.202330751

结合卷积增强窗口注意力的双分支语音增强神经网络

Dual-Branch Speech Enhancement Neural Network with Convolutional Enhancement Window Attention

摘要

摘要: 在复杂环境以及突发背景噪音条件下，语音增强任务具有极大的困难和挑战. 主要原因是现有的语音增强方法未能有效捕获语谱图特征，尤其是局部信息. 在过去的研究中，Transformer模型更专注于音频的全局信息，而忽略了局部信息的重要性. 在音频经过短时傅里叶变换（STFT）处理后，多数模型仅使用幅值信息，而忽略了相位信息，导致它们未能有效捕获语谱图特征，从而影响了语音增强的效果. 基于此设计出一个带有卷积增强窗口注意力的双分支语音增强神经网络. 该模型采用U-NET架构，通过双分支结构对音频的幅值和相位信息同时建模；在2个分支之间引入复值计算模块以实现信息交互；在编码器层和解码器层之间的跳跃连接部分采用卷积增强窗口注意力模块，该模块执行基于非重叠窗口的自注意力操作，在捕获局部上下文信息的同时显著降低了语音增强模型的计算复杂度. 该模型在公开的Voicebank-Demand数据集上进行测试，与基线模型DCUNET 16和DCUNET20相比，在客观语音质量评估指标PESQ（perceptual evaluation of speech quality）分别提高了0.51和0.47. 除了PESQ指标外，其他指标也都有显著的提升. 相较于现有的各类语音增强模型，该模型在各项指标上均处于领先水平，尤其是在PESQ得分方面的提升更为显著.

Abstract: In complex environments and under sudden background noise conditions, speech enhancement tasks are extremely challenging due to the limited capturing of spectrogram features by existing methods, especially in capturing local information of the spectrogram. Previous work on Transformer models primarily focused on global information of the audio while neglecting the importance of local information. Many models only utilized the magnitude information and ignored the phase information after the audio underwent short-time Fourier transform (STFT), resulting in suboptimal capturing of spectrogram features and unsatisfactory speech enhancement results. Based on this, we propose a dual-branch speech enhancement neural network with convolutional enhancement window attention. The model adopts a U-NET architecture and simultaneously models the magnitude and phase information of the audio through the dual-branch structure. A complex computation module is introduced for information interaction between the two branches. The convolutional enhancement window attention module is employed in the skip-connection part between the encoder and decoder layers. It performs self-attention based on non-overlapping windows, significantly reducing the computational complexity of the speech enhancement model while capturing local contextual information. The proposed model is evaluated on the publicly available Voicebank-Demand dataset. Compared with the baseline models DCUNET 16 and DCUNET20, the proposed model achieves improvements of 0.51 and 0.47, respectively, in PESQ (perceptual evaluation of speech quality) metric. Other evaluation metrics also show significant enhancements. Compared with various existing speech enhancement models, the proposed model outperforms them in various metrics, particularly demonstrating remarkable improvements in PESQ scores.

HTML全文

参考文献(44)

施引文献

资源附件(0)