• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Zhang Chenhui, Yuan Zhi'an, Qian Yuhua. Dual-Branch Speech Enhancement Neural Network with Convolutional Enhancement Window Attention[J]. Journal of Computer Research and Development, 2025, 62(4): 852-862. DOI: 10.7544/issn1000-1239.202330751
Citation: Zhang Chenhui, Yuan Zhi'an, Qian Yuhua. Dual-Branch Speech Enhancement Neural Network with Convolutional Enhancement Window Attention[J]. Journal of Computer Research and Development, 2025, 62(4): 852-862. DOI: 10.7544/issn1000-1239.202330751

Dual-Branch Speech Enhancement Neural Network with Convolutional Enhancement Window Attention

Funds: This work was supported by the Key Program of National Natural Science Foundation of China (62136005), the National Science and Technology Major Project (2021ZD0112400), and the Shanxi Provincial Science and Technology Major Special Plan "Unveiled" Project (202201020101006).
More Information
  • Author Bio:

    Zhang Chenhui: born in 1999. Master candidate. Student member of CCF. His main research interest includes speech enhancement

    Yuan Zhi'an: born in 1998. PhD candidate. Student member of CCF. His main research interests include signal enhancement and machine learning

    Qian Yuhua: born in 1976. PhD, professor, PhD supervisor. Member of CCF. His main research interests include artificial intelligence, big data, machine learning and data mining

  • Received Date: September 19, 2023
  • Revised Date: November 18, 2024
  • Accepted Date: January 07, 2025
  • Available Online: January 20, 2025
  • In complex environments and under sudden background noise conditions, speech enhancement tasks are extremely challenging due to the limited capturing of spectrogram features by existing methods, especially in capturing local information of the spectrogram. Previous work on Transformer models primarily focused on global information of the audio while neglecting the importance of local information. Many models only utilized the magnitude information and ignored the phase information after the audio underwent short-time Fourier transform (STFT), resulting in suboptimal capturing of spectrogram features and unsatisfactory speech enhancement results. Based on this, we propose a dual-branch speech enhancement neural network with convolutional enhancement window attention. The model adopts a U-NET architecture and simultaneously models the magnitude and phase information of the audio through the dual-branch structure. A complex computation module is introduced for information interaction between the two branches. The convolutional enhancement window attention module is employed in the skip-connection part between the encoder and decoder layers. It performs self-attention based on non-overlapping windows, significantly reducing the computational complexity of the speech enhancement model while capturing local contextual information. The proposed model is evaluated on the publicly available Voicebank-Demand dataset. Compared with the baseline models DCUNET 16 and DCUNET20, the proposed model achieves improvements of 0.51 and 0.47, respectively, in PESQ (perceptual evaluation of speech quality) metric. Other evaluation metrics also show significant enhancements. Compared with various existing speech enhancement models, the proposed model outperforms them in various metrics, particularly demonstrating remarkable improvements in PESQ scores.

  • [1]
    Lim J, Oppenheim A. All-pole modeling of degraded speech[J]. IEEE Transactions on Acoustics, Speech and Signal Processing, 1978, 26(3): 197−210 doi: 10.1109/TASSP.1978.1163086
    [2]
    Boll S. Suppression of acoustic noise in speech using spectral subtraction[J]. IEEE Transactions on Acoustics, Speech and Signal Processing, 1979, 27(2): 113−120 doi: 10.1109/TASSP.1979.1163209
    [3]
    Ephraim Y, Van Trees H. A signal subspace approach for speechenhancement[J]. IEEE Transactions on Speech and Audio Processing, 1995, 3(4): 251−266 doi: 10.1109/89.397090
    [4]
    时文华,倪永婧,张雄伟,等. 联合稀疏非负矩阵分解和神经网络的语音增强[J]. 计算机研究与发展,2018,55(11):2430−2438 doi: 10.7544/issn1000-1239.2018.20170580

    Shi Wenhua, Ni Yongjing, Zhang Xiongwei, et al. Deep neural network based monaural speech enhancement with sparse nonnegative matrix factorization[J]. Journal of Computer Research and Development, 2018, 55(11): 2430−2438 (in Chinese) doi: 10.7544/issn1000-1239.2018.20170580
    [5]
    Ali M N, Brutti A, Falavigna D. Speech enhancement using dilated Wave-U-Net: An experimental analysis[C]//Proc of the 27th Conf of Open Innovations Association (FRUCT). Piscataway, NJ: IEEE, 2020: 3−9
    [6]
    Zhang Q,Nicolson A,Wang M,et al. DeepMMSE:A deep learning approach to MMSE-based noise power spectral density estimation[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2020,28:1404−1415
    [7]
    Park S R, Lee J. A fully convolutional neural network for speech enhancement[J]. arXiv preprint, arXiv: 1609.07132, 2016
    [8]
    Pandey A, Wang Deliang. Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain[C]//Proc of the 45th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 6629−6633
    [9]
    Pandey A, Wang Deliang. Dual-path self-attention RNN for real-time speech enhancement[J]. arXiv preprint, arXiv: 2010.12713, 2020
    [10]
    Ye Moujia,Wan Hongjie. Improved transformer-based dual-path network with amplitude and complex domain feature fusion for speech enhancement[J]. Entropy,2023,25(2):228
    [11]
    Yu Guochen, Li Andong, Zheng Chenshi, et al. Dual-branch attention-in-attention transformer for single-channel speech enhancement[C]//Proc of the 47th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7847−7851
    [12]
    Lee J,Kang H G. Real-time neural speech enhancement based on temporal refinement network and channel-wise gating methods[J]. Digital Signal Processing,2023,133:103879
    [13]
    Kong Z, Ping W, Dantrey A, et al. Speech denoising in the waveform domain with self-attention[C]//Proc of the 47th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7867−7871
    [14]
    Baby D, Verhulst S. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty[C]//Proc of the 44th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 106−110
    [15]
    Hao Xiang, Su Xiangdong, Horaud R, et al. Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement[C]//Proc of the 46th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 6633−6637
    [16]
    Tan K, Wang Deliang. A convolutional recurrent neural network for real-time speech enhancement[C]//Proc of INTERSPEECH 2018. Grenoble, France: ISCA, 2018: 3229−3233
    [17]
    Kim J, El-Khamy M, Lee J. T-GSA: Transformer with Gaussian-weighted self-attention for speech enhancement[C]//Proc of the 45th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 6649−6653
    [18]
    Pandey A,Wang Deliang. Dense CNN with self-attention for time-domain speech enhancement[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:1270-1279
    [19]
    Erdogan H, Hershey J R, Watanabe S, et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks[C]//Proc of the 45th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2015: 708−712
    [20]
    Williamson D S, Wang Yuxuan, Wang Deliang. Complex ratio masking for monaural speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 24(3): 483−492
    [21]
    Choi H S, Kim J H, Huh J, et al. Phase-aware speech enhancement with deep complex U-NET[J]. arXiv preprint, arXiv: 1903.03107, 2019
    [22]
    Macartney C, Weyde T. Improved speech enhancement with the wave-U-NET[J]. arXiv preprint, arXiv: 1811.11307, 2018
    [23]
    Anmol G, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition[J]. arXiv preprint, arXiv: 2005.08100, 2020
    [24]
    Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proc of the 31st Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2017: 5998−6008
    [25]
    Liu Ze, Lin Yutong, Cao Yue, et al. Swin Transformer: Hierarchical vision transformer using shifted windows[C]//Proc of the 39th IEEE/CVF Int Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 10012−10022
    [26]
    Koizumi Y, Harada N, Haneda Y. Trainable adaptive window switching for speech enhancement[C]//Proc of the 49th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 616−620
    [27]
    Parvathala V, Andhavarapu S, Pamisetty G, et al. Neural comb filtering using sliding window attention network for speech enhancement[J]. Circuits, Systems, and Signal Processing, 2023, 42(1): 322−343 doi: 10.1007/s00034-022-02123-2
    [28]
    Liang Xinyan, Qian Yuhua, Guo Qian, et al. AF: An association-based fusion method for multi-modal classification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(12): 9236−9254
    [29]
    Hu Yanxin, Liu Yun, Lv Shubo, et al. DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement[J]. arXiv preprint, arXiv: 2008.00264, 2020
    [30]
    Valentini-Botinhao C, Wang Xin, Takaki S, et al. Investigating RNN-based speech enhancement methods for noise-robust text-to-speech[C]//Proc of the 9th ISCA Int Conf on Speech Synthesis Workshop. Grenoble, France: ISCA, 2016: 146−152
    [31]
    Peer T, Gerkmann T. Phase-aware deep speech enhancement: It’s all about the frame length[J]. JASA Express Letters, 2022, 2(10), 104802
    [32]
    Wang Zhedong, Cun Xiaodong, Bao Jianming, et al. Uformer: A general U-shaped transformer for image restoration[C]//Proc of the 41st IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 17683−17693
    [33]
    Luo Yi, Chen Zhuo, Yoshioka T. Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation[C]//Proc of the 50th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 46−50
    [34]
    Thiemann J, Ito N, Vincent E. The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings[J] Journal of the Acoustical Society of America, 2013, 19(1): 035081
    [35]
    Loizou P C. Speech Enhancement: Theory and Practice[M]. Boca Raton: CRC Press, 2013
    [36]
    Hu Yi, Loizou P C. Evaluation of objective quality measures for speech enhancement[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 16(1): 229−238
    [37]
    Pascual S, Bonafonte A, Serra J. SEGAN: Speech enhancement generative adversarial network[J]. arXiv preprint, arXiv: 1703.09452, 2017
    [38]
    Tsun-An Hsieh,Wang Hsin-Min,Lu Xuguang,et al. WaveCRN:An efficient convolutional recurrent neural network for end-to-end speech enhancement[J]. IEEE Signal Processing Letters,2020,27:2149−2153
    [39]
    Wang Kai, He Bengbeng, Zhu Weiping. TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain[C]//Proc of the 51st IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 7098−7102
    [40]
    Wang Ning,Ma Sihan,Li Jingyuan,et al. Multistage attention network for image inpainting[J]. Pattern Recognition,2020,106:107448
    [41]
    Fu S W, Yu Cheng, Hsieh T A, et al. MetricGAN+: An improved version of metricGAN for speech enhancement[J]. arXiv preprint, arXiv: 2104.03538, 2021
    [42]
    黄翔东, 陈红红, 甘霖. 基于频率-时间扩张密集网络的语音增强方法[J]. 计算机研究与发展,2023,60(5):1628−1638

    Huang Xiangdong, Chen Honghong, Gan lin. Speech enhancement method based on frequency-time dilated dense network[J]. Journal of Computer Research and Development, 2023, 60(5): 1628−1638 (in Chinese)
    [43]
    Yu Guochen, Li Andong, Zheng Chengshi, et al. Dual-branch attention-in-attention transformer for single-channel speech enhancement[C]//Proc of the 52nd IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7847−7851
    [44]
    Yu Guochen,Li Andong,Wang Hui,et al. DBT-Net:Dual-branch federative magnitude and phase estimation with attention-in-attention transformer for monaural speech enhancement[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2022,30:2629−2644
  • Related Articles

    [1]Gao Yujia, Wang Pengfei, Liu Liang, Ma Huadong. Personalized Federated Learning Method Based on Attention-Enhanced Meta-Learning Network[J]. Journal of Computer Research and Development, 2024, 61(1): 196-208. DOI: 10.7544/issn1000-1239.202220922
    [2]Li Gengsong, Liu Yi, Zheng Qibin, Li Xiang, Liu Kun, Qin Wei, Wang Qiang, Yang Changhong. Algorithm Selection Method Based on Multi-Objective Hybrid Ant Lion Optimizer[J]. Journal of Computer Research and Development, 2023, 60(7): 1533-1550. DOI: 10.7544/issn1000-1239.202220769
    [3]Ren Jiarui, Zhang Haiyan, Zhu Menghan, Ma Bo. Embedding Learning Algorithm for Heterogeneous Network Based on Meta-Graph Convolution[J]. Journal of Computer Research and Development, 2022, 59(8): 1683-1693. DOI: 10.7544/issn1000-1239.20220063
    [4]Jia Xibin, Zeng Meng, Mi Qing, Hu Yongli. Domain Alignment Adversarial Unsupervised Cross-Domain Text Sentiment Analysis Algorithm[J]. Journal of Computer Research and Development, 2022, 59(6): 1255-1270. DOI: 10.7544/issn1000-1239.20210039
    [5]Wang Yuan, Chen Ming, Xing Lining, Wu Yahui, Ma Wubin, Zhao Hong. Deep Intelligent Ant Colony Optimization for Solving Travelling Salesman Problem[J]. Journal of Computer Research and Development, 2021, 58(8): 1586-1598. DOI: 10.7544/issn1000-1239.2021.20210320
    [6]Song Rui, LiTong, Dong Xin, Ding Zhiming. A User Requirements Preference Analysis Method of Mobile Applications Based on Meta-Path Embedding[J]. Journal of Computer Research and Development, 2021, 58(4): 749-762. DOI: 10.7544/issn1000-1239.2021.20200737
    [7]Wu Yao, Shen Derong, Kou Yue, Nie Tiezheng, Yu Ge. Heterogeneous Information Networks Embedding Based on Multiple Meta-Graph Fusion[J]. Journal of Computer Research and Development, 2020, 57(9): 1928-1938. DOI: 10.7544/issn1000-1239.2020.20190553
    [8]Jia Xibin, Jin Ya, Chen Juncheng. Domain Alignment Based on Multi-Viewpoint Domain-Shared Feature for Cross-Domain Sentiment Classification[J]. Journal of Computer Research and Development, 2018, 55(11): 2439-2451. DOI: 10.7544/issn1000-1239.2018.20170496
    [9]Sun Jing, Yu Hongliang, and Zheng Weimin. Index of Meta-Data Set of the Similar Files for Inline De-Duplication in Distributed Storage Systems[J]. Journal of Computer Research and Development, 2013, 50(1): 197-205.
    [10]Lu Han, Cao Cungen, Wang Shi. Implementation of a Meta-Property Based Quantity Attribute-Value Extraction System[J]. Journal of Computer Research and Development, 2010, 47(10): 1741-1748.
  • Cited by

    Periodical cited type(18)

    1. 徐宁,李静秋,王岚君,刘安安. 时序特性引导下的谣言事件检测方法评测. 南京大学学报(自然科学). 2025(01): 71-82 .
    2. 崔蒙蒙,刘井平,阮彤,宋雨秋,杜渂. 基于双重多视角表示的目标级隐性情感分类. 计算机工程. 2024(01): 79-90 .
    3. 张乐怡,周怡洁,俞定国,闫燕勤. 媒介变迁下的谣言传播研究. 新媒体研究. 2024(14): 12-16 .
    4. 王世雄,吴泽政. 基于异质信息网络表征学习的微博虚假信息甄别研究. 情报杂志. 2024(12): 152-160 .
    5. 陈雄逸,许力,张欣欣,尤玮婧. 社交网络基于意见领袖的谣言抑制方案. 信息安全研究. 2023(01): 57-65 .
    6. 张欣欣 ,许力 ,徐振宇 . 基于网络模体的移动社会网络信息可控传播方法. 电子与信息学报. 2023(02): 635-643 .
    7. 杨晓晖,王卫宾. 基于门控图神经网络的谣言检测模型. 燕山大学学报. 2023(01): 73-81 .
    8. 孙书魁,范菁,李占稳,曲金帅,路佩东. 人工智能在新型冠状病毒肺炎中的研究综述. 计算机工程与应用. 2023(05): 28-39 .
    9. 陈卓敏,王莉,朱小飞,王子康. 基于对抗图增强对比学习的虚假新闻检测. 中文信息学报. 2023(06): 137-146 .
    10. 鲁贻锦,吴蕾. 基于大数据驱动技术的媒体风险感知模型研究. 佳木斯大学学报(自然科学版). 2023(06): 52-56 .
    11. 许云红,崔乐靖,朱南丽,郑娜娜. 社交媒体用户谣言传播行为的影响因素研究综述. 新媒体研究. 2023(24): 14-17+33 .
    12. 龙小农,靳旭鹏. 新冠疫情、信息疫情与政治疫情的互动关系及作用机制. 现代传播(中国传媒大学学报). 2022(02): 66-76 .
    13. 杨秀璋,刘建义,任天舒,宋籍文,武帅,姜婧怡,陈登建,周既松,李娜. 基于改进LDA-CNN-BiLSTM模型的社交媒体情感分析研究. 现代计算机. 2022(02): 29-36 .
    14. 张放,范琳琅. 公共危机中社交媒体辟谣信息采纳的关键要素探究——基于新冠疫情微博辟谣文本的计算分析. 新闻界. 2022(10): 75-85 .
    15. 朱梦蝶,付少雄,郑德俊,李杨. 文献视角下的社交媒体健康谣言研究:特征、传播与治理. 图书情报知识. 2022(05): 131-143 .
    16. 肖喜珠,杨闻远,高慧敏,高世奇,郭书恒,路思玲,聂欣政,任书漫,王一民,温馨. “后真相”时代的风险感知与反击:青年社交媒体用户信息行为研究. 新媒体研究. 2022(21): 40-46 .
    17. 徐建民,王恺霖,吴树芳. 基于改进D-S证据理论的微博不可信用户识别研究. 数据分析与知识发现. 2022(12): 99-112 .
    18. 周晖. 国内外基于社交媒体的社会情绪对比分析. 中华医学图书情报杂志. 2022(12): 65-69 .

    Other cited types(21)

Catalog

    Article views (60) PDF downloads (27) Cited by(39)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return