Abstract:
With the rapid development of deep learning-based Voice Conversion (VC) technology, attackers can compromise the security of Automatic Speaker Verification (ASV) systems by tampering with the source speaker's speech signal to mimic the target speaker's acoustic characteristics. The existing anti-spoofing countermeasures primarily focus on distinguishing between genuine and spoofed speech, yet underutilize the implicit source speaker identity infor-mation concealed within the spoofed speech. This limitation hinders their ability to meet practical demands in sce-narios such as forensic investigation and fraud traceability. Currently, research on source speaker identity verifica-tion and discrimination remains insufficient, and existing methods suffer from limited generalisation capabilities. To address this, this paper constructs a new voice conversion spoofing dataset specifically designed for source speaker verification, named VCSD-SSV (Voice Conversion Spoofing Dataset for Source Speaker Verification). This dataset encompasses six VC algorithms (i.e., Seed-VC, FACodec, DDDM-VC, Diff-HierVC, TriAAN-VC, and Free-VC) and includes 400 speakers, totaling 327,677 speech samples (46,811 genuine samples and 280,866 converted sam-ples). Building upon this, we propose a feature generalisation enhancement-based source speaker verification model, FGE-SSV (Source Speaker Verification based on Feature Generalised Enhancement). This model extracts mul-ti-level source speaker timbre features through a feature enhancement module and cascaded instance-batch normali-zation squeeze-and-excitation blocks. Experimental results demonstrate that FGE-SSV maintains consistently low EER (0.2%~1.0%) in within-domain results across all six VC methods in the VCSD-SSV dataset and also achieves relatively low EER across the six VC methods in the SSTC2024 dataset. Furthermore, FGE-SSV demonstrates strong generalization performance on unseen voice conversion methods, with its EER values outperforming those of comparative methods.