Abstract:
At present, end-to-end speech neural codecs, represented by SoundStream, have demonstrated outstanding performance in reconstructed speech quality. However, these methods require extensive convolutional computations, leading to lengthy encoding times. To address this issue, this paper introduces a neural speech codec method based on Mel spectrogram and squeezed excitation-weighted quantization. This method aims to maintain high speech perceptual quality while reducing computational costs and increasing operational speed, thereby minimizing latency. Specifically, this paper utilizes Mel spectrogram features as input, capitalizes on the temporal compression properties during Mel spectrogram extraction, and combines a lower-layer convolutional encoder to simplify the computation process. Additionally, inspired by squeezed excitation network concepts, this paper extracts excitation weights for each dimension of the output features from the encoder’s final layer. These weights are used as the weighting coefficients for each dimension of the compressed features when calculating codebook distances in the quantizer, thus enabling the learning of correlations between features and enhancing the performance of quantization. Experimental results on the LibriTTS and VCTK datasets indicate that this method significantly enhances the computational speed of the encoder and improves the reconstructed speech quality at lower bit rates (≤3 Kbps). For instance, at a bitrate of 1.5 Kbps, the Real-Time Factor (RTF) of encoding computations can increase by up to 4.6 times. Regarding perceptual quality, at a bitrate of 0.75 Kbps, objective metrics such as Short-Time Objective Intelligibility (STOI) and Virtual Speech Quality Objective Listener (VISQOL) show an average improvement of 8.72% compared to the baseline. Additionally, ablation studies not only demonstrate that the optimization effect of compressed excitation weight methods is inversely correlated with bit rate, but also reveal that, compared to the periodic activation function Snake, the Relu activation function can significantly speed up processing while maintaining comparable speech perceptual quality.