高级检索

    基于梅尔谱与压缩激励加权量化的语音神经编解码方法

    Mel Spectrogram and Squeeze-Excitation-Weighted Quantization for Neural Speech Codec

    • 摘要: 目前,以 SoundStream 等为代表的端到端语音神经编解码器在重建语音感知质量方面展现了优异性能. 然而,这些方法需要大量的卷积计算,从而导致较长的编码时间消耗. 为缓解上述问题,提出基于梅尔谱和压缩激励加权量化的神经语音编解码方法. 该方法旨在保持较高语音感知质量的同时降低计算代价,加快运行速度,从而减少时延. 具体而言,采用梅尔谱特征作为输入,借助梅尔谱提取过程中时域压缩的性质,并结合低层卷积编码器以简化运算过程. 此外,借鉴压缩激励网络思想,提取了编码器最后一层输出特征各维度的激励权重,将其作为量化器中计算码本距离时压缩特征各维度的权重系数,由此学习特征间的相关性并优化量化性能. 在 LibriTTS 和 VCTK 数据集上的实验结果表明,该方法显著提升了编码器计算速度,且能在较低比特率时(≤3 Kbps)提升重建语音质量. 以比特率1.5 Kbps时为例,编码计算实时率(real-time factor,RTF)最多可提升4.6倍. 对于感知质量,以0.75 Kbps为例,短时客观可懂度(short-time objective intelligibility,STOI),虚拟语音质量客观评估(virtual speech quality objective listener,VISQOL)等客观指标相较基线平均可提升8.72%. 此外,消融实验不仅表明压缩激励权重方法的优化效果与比特率呈反相关,而且发现Relu激活函数相较周期性质激活函数Snake而言,在语音感知质量相当的情况下,能大量加快运行速度.

       

      Abstract: At present, end-to-end speech neural codecs, represented by SoundStream, have demonstrated outstanding performance in reconstructed speech quality. However, these methods require extensive convolutional computations, leading to lengthy encoding times. To address this issue, this paper introduces a neural speech codec method based on Mel spectrogram and squeezed excitation-weighted quantization. This method aims to maintain high speech perceptual quality while reducing computational costs and increasing operational speed, thereby minimizing latency. Specifically, this paper utilizes Mel spectrogram features as input, capitalizes on the temporal compression properties during Mel spectrogram extraction, and combines a lower-layer convolutional encoder to simplify the computation process. Additionally, inspired by squeezed excitation network concepts, this paper extracts excitation weights for each dimension of the output features from the encoder’s final layer. These weights are used as the weighting coefficients for each dimension of the compressed features when calculating codebook distances in the quantizer, thus enabling the learning of correlations between features and enhancing the performance of quantization. Experimental results on the LibriTTS and VCTK datasets indicate that this method significantly enhances the computational speed of the encoder and improves the reconstructed speech quality at lower bit rates (≤3 Kbps). For instance, at a bitrate of 1.5 Kbps, the Real-Time Factor (RTF) of encoding computations can increase by up to 4.6 times. Regarding perceptual quality, at a bitrate of 0.75 Kbps, objective metrics such as Short-Time Objective Intelligibility (STOI) and Virtual Speech Quality Objective Listener (VISQOL) show an average improvement of 8.72% compared to the baseline. Additionally, ablation studies not only demonstrate that the optimization effect of compressed excitation weight methods is inversely correlated with bit rate, but also reveal that, compared to the periodic activation function Snake, the Relu activation function can significantly speed up processing while maintaining comparable speech perceptual quality.

       

    /

    返回文章
    返回