• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Liu Xin, Wang Rui, Zhong Bineng, Wang Nannan. Cross Face-Voice Matching via Double-Stream Networks and Bi-Quintuple Loss[J]. Journal of Computer Research and Development, 2022, 59(3): 694-705. DOI: 10.7544/issn1000-1239.20200547
Citation: Liu Xin, Wang Rui, Zhong Bineng, Wang Nannan. Cross Face-Voice Matching via Double-Stream Networks and Bi-Quintuple Loss[J]. Journal of Computer Research and Development, 2022, 59(3): 694-705. DOI: 10.7544/issn1000-1239.20200547

Cross Face-Voice Matching via Double-Stream Networks and Bi-Quintuple Loss

Funds: This work was supported by the National Natural Science Foundation of China (61673185, 61922066, 61972167), the Project of State Key Laboratory of Integrated Services Networks (ISN20-11), the Natural Science Foundation of Fujian Province (2020J01084), and the Zhejiang Laboratory (2021KH0AB01).
More Information
  • Published Date: February 28, 2022
  • Facial information and voice cues are the most natural and flexible ways in human-computer interaction, and some recent researchers are now paying more attention to the intelligent cross-modal perception between the face and voice modalities. Nevertheless, most existing methods often fail to perform well on some challenge cross-modal face-voice matching tasks, mainly due to the complex integration of semantic gap and modality heterogeneity. In this paper, we address an efficient cross-modal face-voice matching network by using double-stream networks and bi-quintuple loss, and the derived feature representations can be well utilized to adapt four challenging cross-modal matching tasks between faces and voices. First, we introduce a novel modality-shared multi-modal weighted residual network to model the face-voice association, by embedding it on the top layer of our double-stream network. Then, a bi-quintuple loss is newly proposed to significantly improve the data utilization, while enhancing the generalization ability of network model. Further, we learn to predict identity (ID) of each person during the training process, which can supervise the discriminative feature learning process. As a result, discriminative cross-modal representations can be well learned for different matching tasks. Within four different cross-modal matching tasks, extensive experiments have shown that the proposed approach performs better than the state-of-the-art methods, by a large margin reaching up to 5%.
  • Cited by

    Periodical cited type(1)

    1. 陈亚颐,尹权,再开日亚·安尼娃尔,张乐. 跨系统的流动性多源异构数据整合算法仿真. 计算机仿真. 2025(01): 410-414 .

    Other cited types(2)

Catalog

    Article views (330) PDF downloads (228) Cited by(3)
    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return