高级检索
    朱明航, 柳欣, 于镇宁, 徐行, 郑书凯. 基于双向伪标签自监督学习的跨人脸-语音匹配方法[J]. 计算机研究与发展, 2023, 60(11): 2638-2649. DOI: 10.7544/issn1000-1239.202220411
    引用本文: 朱明航, 柳欣, 于镇宁, 徐行, 郑书凯. 基于双向伪标签自监督学习的跨人脸-语音匹配方法[J]. 计算机研究与发展, 2023, 60(11): 2638-2649. DOI: 10.7544/issn1000-1239.202220411
    Zhu Minghang, Liu Xin, Yu Zhenning, Xu Xing, Zheng Shukai. Cross Face-Voice Matching Method via Bi-Pseudo Label Based Self-Supervised Learning[J]. Journal of Computer Research and Development, 2023, 60(11): 2638-2649. DOI: 10.7544/issn1000-1239.202220411
    Citation: Zhu Minghang, Liu Xin, Yu Zhenning, Xu Xing, Zheng Shukai. Cross Face-Voice Matching Method via Bi-Pseudo Label Based Self-Supervised Learning[J]. Journal of Computer Research and Development, 2023, 60(11): 2638-2649. DOI: 10.7544/issn1000-1239.202220411

    基于双向伪标签自监督学习的跨人脸-语音匹配方法

    Cross Face-Voice Matching Method via Bi-Pseudo Label Based Self-Supervised Learning

    • 摘要: 神经认知科学研究表明,人类大脑在感知语音的过程中常常将结合人脸信息进行跨模态交互分析. 然而,现有的跨模态人脸-语音关联方法仍面临着对复杂样本敏感、监督信息缺乏以及语义关联不足等挑战,其主要原因是缺少对潜在共性语义的挖掘. 针对这些问题,提出了基于双向伪标签自监督学习的跨模态学习架构,用于人脸-语音关联学习与匹配任务. 首先,构建跨模态加权残差网络来学习人脸-语音的跨模态共享嵌入,然后提出一种新颖的双向伪标签关联的自监督学习方法,旨在通过一种模态的潜在语义信息去监督另一个模态的特征学习,从而基于这种交互式跨模态自监督学习能够挖掘到人脸-语音间更紧密的关联. 为增加挖掘监督信息的判别性,进一步构建了2个辅助损失促使来自相同身份的人脸-语音特征更接近,并使来自不同身份的特征更加疏远. 基于大量实验验证,相比较于现有方法,在人脸-语音跨模态匹配任务上获得了全面的提升.

       

      Abstract: Neurocognitive science research shows that human brain often combines face information on cross-modal interaction analysis during the speech perception process. Nevertheless, existing cross-modal face-voice association methods still face the various challenges such as sensitivity to complex samples, lack of supervised information and insufficient semantic correlation, which mainly due to the lack of mining common semantic embeddings. To tackle these problems, we present an efficient cross-modal face-voice matching method from bi-pseudo label based self-supervised learning. First of all, we introduce a cross-modal weighted residual network to learn face-voice common embeddings, and then propose a novel self-supervised learning method for bi-pseudo label association, which learns the latent semantic supervision of one modality to supervise the feature learning of another modality. Accordingly, based on this interactive cross-modal self-supervised learning, the highly correlated face-voice associations can be well learned. Besides, in order to increase the discrimination of mining supervised information, we further construct two auxiliary losses to make the face-voice features of the same samples closer, while pushing the features of different samples to be far away. After a large number of experiments, the innovative method proposed in this paper has achieved a comprehensive improvement in the cross-modal face-voice matching task compared with the existing work.

       

    /

    返回文章
    返回