高级检索

    面向源说话人验证的语音转换伪造数据集与模型

    Voice Conversion Spoofing Dataset and Model for Source Speaker Verification

    • 摘要: 随着基于深度学习的语音转换(Voice Conversion, VC)技术的快速发展,攻击者能够通过篡改源说话人的语音信号来模仿目标说话人的声学特征,这严重威胁自动说话人验证系统(Automatic Speaker Verification, ASV)的安全性. 现有反欺骗防御机制主要聚焦于区分真实语音和伪造语音,却未能充分利用伪造语音中隐含的源说话人身份信息,导致其在司法取证、欺诈溯源等实际应用场景中难以满足需求. 目前,针对源说话人身份验证与辨别的研究尚不充分,且已有方法泛化性能有限. 基于此,本文构建了一个专门用于源说话人验证的新语音转换伪造数据集VCSD-SSV(Voice Conversion Spoofing Dataset for Source Speaker Verification). 该数据集涵盖Seed-VC、FACodec、DDDM-VC、Diff-HierVC、TriAAN-VC和Free-VC共六种语音转换算法,包含400个说话人,总计327,677条语音样本(其中真实语音46,811条,转换语音280,866条). 在此基础上,我们设计了一种基于特征泛化增强的源说话人验证模型FGE-SSV(Source Speaker Verification Based on Fea-ture Generalised Enhancement). 该模型通过特征增强模块和级联的实例-批量归一化挤压激励块实现多层次源说话人音色特征的提取. 实验结果表明,FGE-SSV方法在VCSD-SSV数据集全部6种VC方法中域内结果保持稳定的低EER(0.2%~1.0%), 在SSTC2024数据集6种VC方法上也保持较低的EER;此外,FGE-SSV在未见过的语音转换方法上展现出良好的泛化性能, 其EER值均优于对比方法.

       

      Abstract: With the rapid development of deep learning-based Voice Conversion (VC) technology, attackers can compromise the security of Automatic Speaker Verification (ASV) systems by tampering with the source speaker's speech signal to mimic the target speaker's acoustic characteristics. The existing anti-spoofing countermeasures primarily focus on distinguishing between genuine and spoofed speech, yet underutilize the implicit source speaker identity infor-mation concealed within the spoofed speech. This limitation hinders their ability to meet practical demands in sce-narios such as forensic investigation and fraud traceability. Currently, research on source speaker identity verifica-tion and discrimination remains insufficient, and existing methods suffer from limited generalisation capabilities. To address this, this paper constructs a new voice conversion spoofing dataset specifically designed for source speaker verification, named VCSD-SSV (Voice Conversion Spoofing Dataset for Source Speaker Verification). This dataset encompasses six VC algorithms (i.e., Seed-VC, FACodec, DDDM-VC, Diff-HierVC, TriAAN-VC, and Free-VC) and includes 400 speakers, totaling 327,677 speech samples (46,811 genuine samples and 280,866 converted sam-ples). Building upon this, we propose a feature generalisation enhancement-based source speaker verification model, FGE-SSV (Source Speaker Verification based on Feature Generalised Enhancement). This model extracts mul-ti-level source speaker timbre features through a feature enhancement module and cascaded instance-batch normali-zation squeeze-and-excitation blocks. Experimental results demonstrate that FGE-SSV maintains consistently low EER (0.2%~1.0%) in within-domain results across all six VC methods in the VCSD-SSV dataset and also achieves relatively low EER across the six VC methods in the SSTC2024 dataset. Furthermore, FGE-SSV demonstrates strong generalization performance on unseen voice conversion methods, with its EER values outperforming those of comparative methods.

       

    /

    返回文章
    返回