• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Zhou Junzuo, Yi Jiangyan, Tao Jianhua, Ren Yong, Wang Tao. Mel Spectrogram and Squeeze-Excitation-Weighted Quantization for Neural Speech Codec[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440329
Citation: Zhou Junzuo, Yi Jiangyan, Tao Jianhua, Ren Yong, Wang Tao. Mel Spectrogram and Squeeze-Excitation-Weighted Quantization for Neural Speech Codec[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440329

Mel Spectrogram and Squeeze-Excitation-Weighted Quantization for Neural Speech Codec

Funds: This work was supported by the Strategic Priority Research Program of Chinese Academy of Sciences (XDB0500103) and the National Natural Science Foundation of China (62322120, U21B2010, 62306316, 62206278).
More Information
  • Author Bio:

    Zhou Junzuo: born in 2000. Master candidate. His main research interests include text-to-speech

    Yi Jiangyan: born in 1984. PhD, Master supervisor. Her main research directions include speech information processing, speech generation and identification, and continuous learning

    Tao Jianhua: born in 1972. PhD, PhD supervisor. His main research directions include intelligent information fusion and processing, speech processing, affective computing, and big data analysis

    Ren Yong: born in 1998. PhD candidate. His main research interests include text-to speech

    Wang Tao: born in 1996. PhD. His main research interests include text-to-speech

  • Received Date: May 20, 2024
  • Revised Date: March 26, 2025
  • Accepted Date: April 03, 2025
  • Available Online: April 03, 2025
  • At present, end-to-end speech neural codecs, represented by SoundStream, have demonstrated outstanding performance in reconstructed speech quality. However, these methods require extensive convolutional computations, leading to lengthy encoding times. To address this issue, this paper introduces a neural speech codec method based on Mel spectrogram and squeezed excitation-weighted quantization. This method aims to maintain high speech perceptual quality while reducing computational costs and increasing operational speed, thereby minimizing latency. Specifically, this paper utilizes Mel spectrogram features as input, capitalizes on the temporal compression properties during Mel spectrogram extraction, and combines a lower-layer convolutional encoder to simplify the computation process. Additionally, inspired by squeezed excitation network concepts, this paper extracts excitation weights for each dimension of the output features from the encoder’s final layer. These weights are used as the weighting coefficients for each dimension of the compressed features when calculating codebook distances in the quantizer, thus enabling the learning of correlations between features and enhancing the performance of quantization. Experimental results on the LibriTTS and VCTK datasets indicate that this method significantly enhances the computational speed of the encoder and improves the reconstructed speech quality at lower bit rates (≤3 Kbps). For instance, at a bitrate of 1.5 Kbps, the Real-Time Factor (RTF) of encoding computations can increase by up to 4.6 times. Regarding perceptual quality, at a bitrate of 0.75 Kbps, objective metrics such as Short-Time Objective Intelligibility (STOI) and Virtual Speech Quality Objective Listener (VISQOL) show an average improvement of 8.72% compared to the baseline. Additionally, ablation studies not only demonstrate that the optimization effect of compressed excitation weight methods is inversely correlated with bit rate, but also reveal that, compared to the periodic activation function Snake, the Relu activation function can significantly speed up processing while maintaining comparable speech perceptual quality.

  • [1]
    De Andrade J F, De Campos M L R, Apolinario J A. Speech privacy for modern mobile communication systems[C]//Proc of the 33rd IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2008: 1777−1780
    [2]
    Haneche H, Ouahabi A, Boudraa B. Compressed sensing-speech coding scheme for mobile communications[J]. Circuits, Systems, and Signal Processing, 2021, 40(10): 5106−5126 doi: 10.1007/s00034-021-01712-x
    [3]
    Budagavi M, Gibson J D. Speech coding in mobile radio communications[J]. Proceedings of the IEEE, 1998, 86(7): 1402−1412 doi: 10.1109/5.681370
    [4]
    Bessette B, Salami R, Lefebvre R, et al. The adaptive multirate wideband speech codec (AMR-WB)[J]. IEEE Transactions on Speech and Audio Processing, 2002, 10(8): 620−636 doi: 10.1109/TSA.2002.804299
    [5]
    Cox R V, Kroon P. Low bit-rate speech coders for multimedia communication[J]. IEEE Communications Magazine, 1996, 34(12): 34−41 doi: 10.1109/35.556484
    [6]
    Huang Yongfeng, Liu Chenghao, Tang Shanyu, et al. Steganography integration into a low-bit rate speech codec[J]. IEEE Transactions on Information Forensics and Security, 2012, 7(6): 1865−1875 doi: 10.1109/TIFS.2012.2218599
    [7]
    Valin J M, Terriberry T B, Montgomery C, et al. A high-quality speech and audio codec with less than 10-ms delay[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2009, 18(1): 58−67
    [8]
    Hicsonmez S, Sencar H T, Avcibas I. Audio codec identification from coded and transcoded audios[J]. Digital Signal Processing, 2013, 23(5): 1720−1730 doi: 10.1016/j.dsp.2013.04.005
    [9]
    Dietz M, Multrus M, Eksler V, et al. Overview of the EVS codec architecture[C]//Proc of the 40th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2015: 5698−5702.
    [10]
    Valin J M, Vos K, Terriberry T. Definition of the opus audio codec[EB/OL]. 2012−09[2024-12-26]. https://datatracker.ietf.org/doc/html/rfc6716
    [11]
    Zeghidour N,Luebs A,Omran A,et al. SoundStream:An end-to-end neural audio codec[J]. IEEE Transactions on Audio,Speech,and Language Processing,2021,30:495-507(只有卷
    [12]
    Biswas A, Jia D. Audio codec enhancement with generative adversarial networks[C]//Proc of the 45th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2020: 356−360
    [13]
    Stimberg F, Narest A, Bazzica A, et al. WaveNetEQ—Packet loss concealment with waveRNN[C]//Proc of the 54th Asilomar Conference on Signals, Systems, and Computers. Piscataway, NJ: IEEE, 2020: 672−676
    [14]
    Xiao Wei, Liu Wenzhe, Wang Meng, et al. Multi-mode neural speech coding based on deep generative networks[C]//Proc of the 24th Annual Conf of the Int Speech Communication Association. Grenoble, France: ISCA, 2023: 819−823
    [15]
    Wu Yi-Chiao, Gebru I D, Marković D, et al. Audiodec: An open-source streaming high-fidelity neural audio codec[C]//Proc of the 48th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2023: 1−5
    [16]
    Jiang Xue, Peng Xiulian, Zhang Yuan, et al. Disentangled feature learning for real-time neural speech coding[C/OL]//Proc of the 48th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2023[2025-02-05]. https://ieeexplore.ieee.org/document/10094723
    [17]
    Petermann D, Jang I, Kim M. Native multi-band audio coding within hyper-autoencoded reconstruction propagation networks[C/OL]//Proc of the 48th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2023[2025-02-05]. https://ieeexplore.ieee.org/document/10094593
    [18]
    Lim H, Lee J, Kim B H, et al. End-to-end neural audio coding in the MDCT domain[C/OL]//Proc of the 48th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2023[2025-02-05]. https://ieeexplore.ieee.org/document/10096243
    [19]
    Kleijn W B, Lim F S C, Luebs A, et al. Wavenet based low rate speech coding[C]//Proc of the 43rd IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2018: 676−680
    [20]
    Jang Inseon, Yang Haici, Lim W, et al. Personalized neural speech codec[C]//Proc of the 49th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2024: 991−995
    [21]
    Du Zhihao, Zhang Shiliang, Hu Kai, et al. Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec[C]//Proc of the 49th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2024: 591−595
    [22]
    ârbacea C, Oord A, Li Yazhe, et al. Low bit-rate speech coding with VQ-VAE and a WaveNet decoder[C]//Proc of the 44th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2019: 735−739
    [23]
    Kleijn W B, Storus A, Chinen M, et al. Generative speech coding with predictive variance regularization[C]//Proc of the 45th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 6478−6482
    [24]
    Oord A, Dieleman S, Zen Heiga, et al. Wavenet: A generative model for raw audio[J]. arXiv preprint, arXiv: 1609.03499, 2016
    [25]
    Kankanahalli S. End-to-end optimized speech coding with deep neural networks[C]//Proc of the 43rd IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2018: 2521−2525
    [26]
    Oord A, Vinyals O. Neural discrete representation learning[C/OL]//Proc of the 31st Annual Conf on Neural Information Processing Systems (NIPS). Cambridge, MA: MIT Press, 2017[2025-02-05]. https://dl.acm.org/doi/10.5555/3295222.3295378
    [27]
    Défossez A, Copet J, Synnaeve G, et al. High fidelity neural audio compression[J]. arXiv preprint, arXiv: 2210.13438, 2022
    [28]
    Ratnarajah Anton, Zhang Shi-Xiong, Yu Dong. M3-AUDIODEC: Multi-channel multi-speaker multi-spatial audio codec[J]. arXiv preprint, arXiv: 2309.07416, 2023
    [29]
    Yang Dongchao, Liu Songxiang, Huang Rongjie, et al. Hifi-codec: Group-residual vector quantization for high fidelity audio codec[J]. arXiv preprint, arXiv: 2305.02765, 2023
    [30]
    O’shaughnessy D. Speech Communications: Human and Machine[M]. Piscataway, NJ: IEEE, 1999
    [31]
    Davis S, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences[J]. IEEE Transactions on Acoustics, Speech, and Signal processing, 1980, 28(4): 357−366 doi: 10.1109/TASSP.1980.1163420
    [32]
    Hasanabadi M R. MFCC-GAN codec: A new AI-based audio coding[J]. arXiv preprint, arXiv: 2310.14300, 2023
    [33]
    Hu Jie, Shen Li, Sun Gang. Squeeze-and-excitation networks[C]//Proc of the 31st IEEE Conf on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ: IEEE, 2018: 7132−7141
    [34]
    Zen Heiga, Dang V, Clark R, et al. Libritts: A corpus derived from librispeech for text-to-speech[J]. arXiv preprint, arXiv: 1904.02882, 2019
    [35]
    Liu Zhaoyu, Mak B. Cross-lingual multi-speaker text-to-speech synthesis for voice cloning without using parallel corpus for unseen speakers[J]. arXiv preprint, arXiv: 1911.11601, 2019
    [36]
    Taal C H, Hendriks R C, Heusdens R, et al. A short-time objective intelligibility measure for time-frequency weighted noisy speech[C]//Proc of the 35th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2010: 4214−4217
    [37]
    Hines A, Skoglund J, Kokaram A, et al. ViSQOL v3: An open source production ready objective speech and audio mestric[J]. arXiv preprint, arXiv: 2004.09584, 2020
    [38]
    Kumar R, Seetharaman P, Luebs A, et al. High-fidelity audio compression with improved rvqgan[C/OL]//Proc of the 38th Annual Conf on Neural Information Processing Systems (NIPS). Cambridge, MA: MIT Press, 2024[2025-02-05]. https://openreview.net/forum?id=qjnl1QUnFA
    [39]
    Lee S, Ping W, Ginsburg B, et al. Bigvgan: A universal neural vocoder with large-scale training[J]. arXiv preprint. arXiv: 2206.04658, 2022
    [40]
    Martin Dietz, Markus Multrus, Vaclav Eksler, et al. Overview of the EVS codec architecture[C]//Proc of the 40th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2015: 5698–5702
    [41]
    高玮玮, 单明陶, 宋楠, 等. 嵌入SENet的改进YOLOv4眼底图像微动脉瘤自动检测算法[J]. 生物医学工程学杂志,2022,39(4):713−720

    Gao Weiwei, Shan Mingtao, Song Nan, et al. Detection of microaneurysms in fundus images based on improved YOLOv4 with SENet embedded[J]. Journal of Biomedical Engineering, 2022, 39(4): 713−720
    [42]
    Chen Qiang, Liu Li, Han Rui, et al. Image identification method on high speed railway contact network based on YOLO v3 and SENet[C]//Proc of the 38th Chinese Control Conf (CCC). Piscataway, NJ: IEEE, 2019: 8772−8777
    [43]
    王成龙,易江燕,陶建华,等. 基于全局-时频注意力网络的语音伪造检测[J]. 计算机研究与发展,2021,58(7):1466−1475 doi: 10.7544/issn1000-1239.2021.20200799

    Wang Chenglong, Yi Jiangyan, Tao Jianhua, et al. Global and temporal-frequency attention based network in audio deepfake detection[J]. Journal of Computer Research and Development, 2021, 58(7): 1466−1475 (in Chinese) doi: 10.7544/issn1000-1239.2021.20200799
    [44]
    He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition[C]//Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ: IEEE, 2016: 770−778
    [45]
    Kong J, Kim J, Bae J. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis[C]//Proc of the 34th Annual Conf on Neural Information Processing Systems (NIPS). Cambridge, MA: MIT Press, 2020, 17022−17033
    [46]
    Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[C/OL]//Proc of the 28th Annual Conf on Neural Information Processing Systems (NIPS). Cambridge, MA: MIT Press, 2014[2025-02-05]. https://www.researchgate.net/publication/263012109_Generative_Adversarial_Networks
    [47]
    Jassim W A, Skoglund J, Chinen M, et al. WARP-Q: Quality prediction for generative neural speech codecs[C]//Proc of the 46th IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2021: 401−405
  • Related Articles

    [1]Xue Zhihang, Xu Zheming, Lang Congyan, Feng Songhe, Wang Tao, Li Yidong. Text-to-Image Generation Method Based on Image-Text Semantic Consistency[J]. Journal of Computer Research and Development, 2023, 60(9): 2180-2190. DOI: 10.7544/issn1000-1239.202220416
    [2]Zhang Jing, Ju Jialiang, Ren Yonggong. Double-Generators Network for Data-Free Knowledge Distillation[J]. Journal of Computer Research and Development, 2023, 60(7): 1615-1627. DOI: 10.7544/issn1000-1239.202220024
    [3]Zhang Hao, Ma Jiayi, Fan Fan, Huang Jun, Ma Yong. Infrared and Visible Image Fusion Based on Multiclassification Adversarial Mechanism in Feature Space[J]. Journal of Computer Research and Development, 2023, 60(3): 690-704. DOI: 10.7544/issn1000-1239.202110639
    [4]Liu Guangrui, Zhang Weizhe, Li Xinjie. Data Contamination Defense Method for Intelligent Network Intrusion Detection Systems Based on Edge Examples[J]. Journal of Computer Research and Development, 2022, 59(10): 2348-2361. DOI: 10.7544/issn1000-1239.20220509
    [5]Guo Zhengshan, Zuo Jie, Duan Lei, Li Renhao, He Chengxin, Xiao Yingjie, Wang Peiyan. A Generative Adversarial Negative Sampling Method for Knowledge Hypergraph Link Prediction[J]. Journal of Computer Research and Development, 2022, 59(8): 1742-1756. DOI: 10.7544/issn1000-1239.20220074
    [6]Chen Dawei, Fu Anmin, Zhou Chunyi, Chen Zhenzhu. Federated Learning Backdoor Attack Scheme Based on Generative Adversarial Network[J]. Journal of Computer Research and Development, 2021, 58(11): 2364-2373. DOI: 10.7544/issn1000-1239.2021.20210659
    [7]Dai Hong, Sheng Lijie, Miao Qiguang. Adversarial Discriminative Domain Adaptation Algorithm with CapsNet[J]. Journal of Computer Research and Development, 2021, 58(9): 1997-2012. DOI: 10.7544/issn1000-1239.2021.20200569
    [8]Qian Yaguan, He Niannian, Guo Yankai, Wang Bin, Li Hui, Gu Zhaoquan, Zhang Xuhong, Wu Chunming. An Evasion Algorithm to Fool Fingerprint Detector for Deep Neural Networks[J]. Journal of Computer Research and Development, 2021, 58(5): 1106-1117. DOI: 10.7544/issn1000-1239.2021.20200903
    [9]Yu Haitao, Yang Xiaoshan, Xu Changsheng. Antagonistic Video Generation Method Based on Multimodal Input[J]. Journal of Computer Research and Development, 2020, 57(7): 1522-1530. DOI: 10.7544/issn1000-1239.2020.20190479
    [10]Tian Jiwei, Wang Jinsong, Shi Kai. Positive and Unlabeled Generative Adversarial Network on POI Positioning[J]. Journal of Computer Research and Development, 2019, 56(9): 1843-1850. DOI: 10.7544/issn1000-1239.2019.20180847

Catalog

    Article views (25) PDF downloads (3) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return