• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Zhang Chenhui, Yuan Zhi'an, Qian Yuhua. Dual-Branch Speech Enhancement Neural Network with Convolutional Enhancement Window Attention[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202330751
Citation: Zhang Chenhui, Yuan Zhi'an, Qian Yuhua. Dual-Branch Speech Enhancement Neural Network with Convolutional Enhancement Window Attention[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202330751

Dual-Branch Speech Enhancement Neural Network with Convolutional Enhancement Window Attention

Funds: This work was supported by the Key Program of National Natural Science Foundation of China (62136005), the National Science and Technology Major Project (2021ZD0112400), and the Shanxi Provincial Science and Technology Major Special Plan "Unveiled" Project (202201020101006).
More Information
  • Author Bio:

    Zhang Chenhui: born in 1999. Master candidate. Student member of CCF. His main research interest includes speech enhancement

    Yuan Zhi'an: born in 1998. PhD candidate. Student member of CCF. His main research interests include signal enhancement and machine learning

    Qian Yuhua: born in 1976. PhD, professor, PhD supervisor. Member of CCF. His main research interests include artificial intelligence, big data, machine learning and data mining

  • Received Date: September 19, 2023
  • Revised Date: November 18, 2024
  • Accepted Date: January 07, 2025
  • Available Online: January 20, 2025
  • In complex environments and under sudden background noise conditions, speech enhancement tasks are extremely challenging due to the limited capturing of spectrogram features by existing methods, especially in capturing local information of the spectrogram. Previous work on Transformer models primarily focused on global information of the audio while neglecting the importance of local information. Many models only utilized the magnitude information and ignored the phase information after the audio underwent short-time Fourier transform (STFT), resulting in suboptimal capturing of spectrogram features and unsatisfactory speech enhancement results. Based on this, we propose a dual-branch speech enhancement neural network with convolutional enhancement window attention. The model adopts a U-NET architecture and simultaneously models the magnitude and phase information of the audio through the dual-branch structure. A complex computation module is introduced for information interaction between the two branches. The convolutional enhancement window attention module is employed in the skip-connection part between the encoder and decoder layers. It performs self-attention based on non-overlapping windows, significantly reducing the computational complexity of the speech enhancement model while capturing local contextual information. The proposed model is evaluated on the publicly available Voicebank-Demand dataset. Compared with the baseline models DCUNET 16 and DCUNET20, the proposed model achieves improvements of 0.51 and 0.47, respectively, in PESQ (perceptual evaluation of speech quality) metric. Other evaluation metrics also show significant enhancements. Compared with various existing speech enhancement models, the proposed model outperforms them in various metrics, particularly demonstrating remarkable improvements in PESQ scores.

  • [1]
    Lim J, Oppenheim A. All-pole modeling of degraded speech[J]. IEEE Transactions on Acoustics, Speech and Signal Processing, 1978, 26(3): 197−210 doi: 10.1109/TASSP.1978.1163086
    [2]
    Boll S. Suppression of acoustic noise in speech using spectral subtraction[J]. IEEE Transactions on Acoustics, Speech and Signal Processing, 1979, 27(2): 113−120 doi: 10.1109/TASSP.1979.1163209
    [3]
    Ephraim Y, Van Trees H. A signal subspace approach for speechenhancement[J]. IEEE Transactions on Speech and Audio Processing, 1995, 3(4): 251−266 doi: 10.1109/89.397090
    [4]
    时文华,倪永婧,张雄伟,等. 联合稀疏非负矩阵分解和神经网络的语音增强[J]. 计算机研究与发展,2018,55(11):2430−2438 doi: 10.7544/issn1000-1239.2018.20170580

    Shi Wenhua, Ni Yongjing, Zhang Xiongwei, et al. Deep neural network based monaural speech enhancement with sparse nonnegative matrix factorization[J]. Journal of Computer Research and Development, 2018, 55(11): 2430−2438 (in Chinese) doi: 10.7544/issn1000-1239.2018.20170580
    [5]
    Ali M N, Brutti A, Falavigna D. Speech enhancement using dilated Wave-U-Net: An experimental analysis[C]//Proc of the 27th Conf of Open Innovations Association (FRUCT). Piscataway, NJ: IEEE, 2020: 3−9
    [6]
    Zhang Q,Nicolson A,Wang M,et al. DeepMMSE:A deep learning approach to MMSE-based noise power spectral density estimation[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2020,28:1404−1415
    [7]
    Park S R, Lee J. A fully convolutional neural network for speech enhancement[J]. arXiv preprint, arXiv: 1609.07132, 2016
    [8]
    Pandey A, Wang Deliang. Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain[C]//Proc of the 45th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 6629−6633
    [9]
    Pandey A, Wang Deliang. Dual-path self-attention RNN for real-time speech enhancement[J]. arXiv preprint, arXiv: 2010.12713, 2020
    [10]
    Ye Moujia,Wan Hongjie. Improved transformer-based dual-path network with amplitude and complex domain feature fusion for speech enhancement[J]. Entropy,2023,25(2):228
    [11]
    Yu Guochen, Li Andong, Zheng Chenshi, et al. Dual-branch attention-in-attention transformer for single-channel speech enhancement[C]//Proc of the 47th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7847−7851
    [12]
    Lee J,Kang H G. Real-time neural speech enhancement based on temporal refinement network and channel-wise gating methods[J]. Digital Signal Processing,2023,133:103879
    [13]
    Kong Z, Ping W, Dantrey A, et al. Speech denoising in the waveform domain with self-attention[C]//Proc of the 47th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7867−7871
    [14]
    Baby D, Verhulst S. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty[C]//Proc of the 44th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 106−110
    [15]
    Hao Xiang, Su Xiangdong, Horaud R, et al. Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement[C]//Proc of the 46th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 6633−6637
    [16]
    Tan K, Wang Deliang. A convolutional recurrent neural network for real-time speech enhancement[C]//Proc of INTERSPEECH 2018. Grenoble, France: ISCA, 2018: 3229−3233
    [17]
    Kim J, El-Khamy M, Lee J. T-gsa: Transformer with Gaussian-weighted self-attention for speech enhancement[C]//Proc of the 45th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 6649−6653
    [18]
    Pandey A,Wang Deliang. Dense CNN with self-attention for time-domain speech enhancement[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:1270-1279
    [19]
    Erdogan H, Hershey J R, Watanabe S, et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks[C]//Proc of the 45th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2015: 708−712
    [20]
    Williamson D S, Wang Yuxuan, Wang Deliang. Complex ratio masking for monaural speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 24(3): 483−492
    [21]
    Choi H S, Kim J H, Huh J, et al. Phase-aware speech enhancement with deep complex U-NET[J]. arXiv preprint, arXiv: 1903.03107, 2019
    [22]
    Macartney C, Weyde T. Improved speech enhancement with the wave-U-NET[J]. arXiv preprint, arXiv: 1811.11307, 2018
    [23]
    Anmol G, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition[J]. arXiv preprint, arXiv: 2005.08100, 2020
    [24]
    Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proc of the 31st Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2017: 5998−6008
    [25]
    Liu Ze, Lin Yutong, Cao Yue, et al. Swin Transformer: Hierarchical vision transformer using shifted windows[C]//Proc of the 39th IEEE/CVF Int Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 10012−10022
    [26]
    Koizumi Y, Harada N, Haneda Y. Trainable adaptive window switching for speech enhancement[C]//Proc of the 49th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 616−620
    [27]
    Parvathala V, Andhavarapu S, Pamisetty G, et al. Neural comb filtering using sliding window attention network for speech enhancement[J]. Circuits, Systems, and Signal Processing, 2023, 42(1): 322−343 doi: 10.1007/s00034-022-02123-2
    [28]
    Liang Xinyan, Qian Yuhua, Guo Qian, et al. AF: An association-based fusion method for multi-modal classification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(12): 9236−9254
    [29]
    Hu Yanxin, Liu Yun, Lv Shubo, et al. DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement[J]. arXiv preprint, arXiv: 2008.00264, 2020
    [30]
    Valentini-Botinhao C, Wang Xin, Takaki S, et al. Investigating RNN-based speech enhancement methods for noise-robust text-to-speech[C]//Proc of the 9th ISCA Int Conf on Speech Synthesis Workshop. Grenoble, France: ISCA, 2016: 146−152
    [31]
    Peer T, Gerkmann T. Phase-aware deep speech enhancement: It’s all about the frame length[J]. JASA Express Letters, 2022, 2(10), 104802
    [32]
    Wang Zhedong, Cun Xiaodong, Bao Jianming, et al. Uformer: A general U-shaped transformer for image restoration[C]//Proc of the 41st IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 17683−17693
    [33]
    Luo Yi, Chen Zhuo, Yoshioka T. Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation[C]//Proc of the 50th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 46−50
    [34]
    Thiemann J, Ito N, Vincent E. The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings[J] Journal of the Acoustical Society of America, 2013, 19(1): 035081
    [35]
    Loizou P C. Speech Enhancement: Theory and Practice[M]. Boca Raton: CRC Press, 2013
    [36]
    Hu Yi, Loizou P C. Evaluation of objective quality measures for speech enhancement[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 16(1): 229−238
    [37]
    Pascual S, Bonafonte A, Serra J. SEGAN: Speech enhancement generative adversarial network[J]. arXiv preprint, arXiv: 1703.09452, 2017
    [38]
    Tsun-An Hsieh,Wang Hsin-Min,Lu Xuguang,et al. WaveCrn:An efficient convolutional recurrent neural network for end-to-end speech enhancement[J]. IEEE Signal Processing Letters,2020,27:2149−2153
    [39]
    Wang Kai, He Bengbeng, Zhu Weiping. TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain[C]//Proc of the 51st IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 7098−7102
    [40]
    Wang Ning,Ma Sihan,Li Jingyuan,et al. Multistage attention network for image inpainting[J]. Pattern Recognition,2020,106:107448
    [41]
    Fu S W, Yu Cheng, Hsieh T A, et al. MetricGAN+: An improved version of metricGAN for speech enhancement[J]. arXiv preprint, arXiv: 2104.03538, 2021
    [42]
    黄翔东, 陈红红, 甘霖. 基于频率-时间扩张密集网络的语音增强方法[J]. 计算机研究与发展,2023,60(5):1628−1638

    Huang Xiangdong, Chen Honghong, Gan lin. Speech enhancement method based on frequency-time dilated dense network[J]. Journal of Computer Research and Development, 2023, 60(5): 1628−1638 (in Chinese)
    [43]
    Yu Guochen, Li Andong, Zheng Chengshi, et al. Dual-branch attention-in-attention transformer for single-channel speech enhancement[C]//Proc of the 52nd IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7847−7851
    [44]
    Yu Guochen,Li Andong,Wang Hui,et al. DBT-Net:Dual-branch federative magnitude and phase estimation with attention-in-attention transformer for monaural speech enhancement[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2022,30:2629−2644
  • Cited by

    Periodical cited type(7)

    1. 马辉,王瑞琴,杨帅. 一种渐进式增长条件生成对抗网络模型. 电信科学. 2023(06): 105-113 .
    2. 杨华芬. 云存储环境下大数据实时动态迁移算法研究. 机械设计与制造工程. 2021(02): 117-122 .
    3. 何少芳,沈陆明,谢红霞. 生成式对抗网络的土壤有机质高光谱估测模型. 光谱学与光谱分析. 2021(06): 1905-1911 .
    4. 卢锦玲,张梦雪,郭鲁豫. 基于GAN的不平衡负荷数据类型辨识方法. 电力科学与工程. 2021(06): 26-34 .
    5. 刘言林. 基于条件生成对抗网络的小样本机器学习数据处理算法研究. 宁夏师范学院学报. 2021(10): 66-73 .
    6. 杨彦荣,宋荣杰,周兆永. 基于GAN-PSO-ELM的网络入侵检测方法. 计算机工程与应用. 2020(12): 66-72 .
    7. 金秋,林馥. 定向网络中隐藏可逆数据的分层追踪算法. 计算机仿真. 2020(10): 226-229+277 .

    Other cited types(23)

Catalog

    Article views (30) PDF downloads (12) Cited by(30)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return