ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2022, Vol. 59 ›› Issue (3): 694-705.doi: 10.7544/issn1000-1239.20200547

Previous Articles     Next Articles

Cross Face-Voice Matching via Double-Stream Networks and Bi-Quintuple Loss

Liu Xin1,2,3, Wang Rui1,3, Zhong Bineng4, Wang Nannan2   

  1. 1(College of Computer Science and Technology, Huaqiao University, Xiamen, Fujian 361021);2(State Key Laboratory of Integrated Services Networks (Xidian University), Xi’an 710071);3(Xiamen Key Laboratory of Computer Vision and Pattern Recognition (Huaqiao University), Xiamen, Fujian 361021);4(School of Computer Science and Information Engineering, Guangxi Normal University, Guilin, Guangxi 541004)
  • Online:2022-03-07
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (61673185, 61922066, 61972167), the Project of State Key Laboratory of Integrated Services Networks (ISN20-11), the Natural Science Foundation of Fujian Province (2020J01084), and the Zhejiang Laboratory (2021KH0AB01).

Abstract: Facial information and voice cues are the most natural and flexible ways in human-computer interaction, and some recent researchers are now paying more attention to the intelligent cross-modal perception between the face and voice modalities. Nevertheless, most existing methods often fail to perform well on some challenge cross-modal face-voice matching tasks, mainly due to the complex integration of semantic gap and modality heterogeneity. In this paper, we address an efficient cross-modal face-voice matching network by using double-stream networks and bi-quintuple loss, and the derived feature representations can be well utilized to adapt four challenging cross-modal matching tasks between faces and voices. First, we introduce a novel modality-shared multi-modal weighted residual network to model the face-voice association, by embedding it on the top layer of our double-stream network. Then, a bi-quintuple loss is newly proposed to significantly improve the data utilization, while enhancing the generalization ability of network model. Further, we learn to predict identity (ID) of each person during the training process, which can supervise the discriminative feature learning process. As a result, discriminative cross-modal representations can be well learned for different matching tasks. Within four different cross-modal matching tasks, extensive experiments have shown that the proposed approach performs better than the state-of-the-art methods, by a large margin reaching up to 5%.

Key words: face-voice associations, cross-modal perception, double-stream networks, bi-quintuple loss, weighted residual network

CLC Number: