面向源说话人验证的语音转换伪造数据集与模型

张国富; 田天才; 苏兆品; 王垚飞; 段宇衡; 周康健; 罗永琪

doi:10.7544/issn1000-1239.202550673

面向源说话人验证的语音转换伪造数据集与模型

Voice Conversion Spoofing Dataset and Model for Source Speaker Verification

摘要

摘要: 随着基于深度学习的语音转换（voice conversion，VC）技术的快速发展，攻击者能够通过篡改源说话人的语音信号来模仿目标说话人的声学特征，这严重威胁自动说话人验证系统（automatic speaker verification，ASV）的安全性。现有反欺骗防御机制主要聚焦于区分真实语音和伪造语音，却未能充分利用伪造语音中隐含的源说话人身份信息，导致其在司法取证、欺诈溯源等实际应用场景中难以满足需求。目前，针对源说话人身份验证与辨别的研究尚不充分，且已有模型泛化性能有限。基于此，构建了一个专门用于源说话人验证的新语音转换伪造数据集VCSD-SSV（voice conversion spoofing dataset for source speaker verification）。该数据集涵盖Seed-VC，FACodec，DDDM-VC，Diff-HierVC，TriAAN-VC，Free-VC共6种语音转换算法，包含400个说话人，总计327 677条语音样本（其中真实语音46 811条，转换语音280 866条）。在此基础上，设计了一种基于特征泛化增强的源说话人验证模型FGE-SSV（source speaker verification based on feature generalised enhancement）。该模型通过特征增强模块和级联的实例-批量归一化挤压激励块实现多层次源说话人音色特征的提取。实验结果表明，FGE-SSV在VCSD-SSV数据集全部6种VC模型中域内结果保持稳定的低EER（0.2%~1.0%），在SSTC2024数据集6种VC模型上也保持较低的EER；此外，FGE-SSV在未见过的VC模型上也展现出良好的泛化性能，其EER值均优于对比模型。

Abstract: With the rapid development of deep learning-based Voice Conversion (VC) technology, attackers can compromise the security of Automatic Speaker Verification (ASV) systems by tampering with the source speaker’s speech signal to mimic the target speaker’s acoustic characteristics. The existing anti-spoofing countermeasures primarily focus on distinguishing between genuine and spoofed speech, yet underutilize the implicit source speaker identity information concealed within the spoofed speech. This limitation hinders their ability to meet practical demands in scenarios such as forensic investigation and fraud traceability. Currently, research on source speaker identity verification and discrimination remains insufficient, and existing methods suffer from limited generalisation capabilities. To address this, this paper constructs a new voice conversion spoofing dataset specifically designed for source speaker verification, named VCSD-SSV (voice conversion spoofing dataset for source speaker verification). This dataset encompasses six VC algorithms (i.e., Seed-VC, FACodec, DDDM-VC, Diff-HierVC, TriAAN-VC, and Free-VC) and includes 400 speakers, totaling 327 677 speech samples (46 811 genuine samples and 280 866 converted samples). Building upon this, we propose a feature generalisation enhancement-based source speaker verification model, FGE-SSV (source speaker verification based on feature generalised enhancement). This model extracts multi-level source speaker timbre features through a feature enhancement module and cascaded instance-batch normalization squeeze-and-excitation blocks. Experimental results demonstrate that FGE-SSV maintains consistently low EER (0.2%~1.0%) in within-domain results across all six VC methods in the VCSD-SSV dataset and also achieves relatively low EER across the six VC methods in the SSTC2024 dataset. Furthermore, FGE-SSV demonstrates strong generalization performance on unseen voice conversion methods, with its EER values outperforming those of comparative methods.

HTML全文

参考文献(32)

施引文献

资源附件(0)