Abstract:
At present, end-to-end speech neural codecs, represented by SoundStream, have demonstrated outstanding performance in reconstructed speech quality. However, these methods require extensive convolutional computations, leading to lengthy encoding times. To address this issue, we introduce a neural speech codec method based on Mel spectrogram and squeeze-excitation-weighted quantization. This method aims to maintain high speech perceptual quality while reducing computational costs and increasing operational speed, thereby minimizing latency. Specifically, we utilize Mel spectrogram features as input, capitalize on the temporal compression properties during Mel spectrogram extraction, and combine a lower-layer convolutional encoder to simplify the computation process. Additionally, inspired by squeezed excitation network concepts, we extract excitation weights for each dimension of the output features from the encoder’s final layer. These weights are used as the weighting coefficients for each dimension of the compressed features when calculating codebook distances in the quantizer, thus enabling the learning of correlations between features and enhancing the performance of quantization. Experimental results on the LibriTTS and VCTK datasets indicate that this method significantly enhances the computational speed of the encoder and improves the reconstructed speech quality at lower bit rates (≤3 Kbps). For instance, at a bitrate of 1.5 Kbps, the real-time factor (RTF) of encoding computations can increase by up to 4.6 times. Regarding perceptual quality, at a bitrate of 0.75 Kbps, objective metrics such as short-time objective intelligibility (STOI) and virtual speech quality objective listener (VISQOL) show an average improvement of 8.72% compared with the baseline. Additionally, ablation studies not only demonstrate that the optimization effect of compressed excitation weight methods is inversely correlated with bitrate, but also reveal that, compared with the periodic activation function Snake, the Relu activation function can significantly speed up processing while maintaining comparable speech perceptual quality.