Liu Xin, Wang Rui, Zhong Bineng, Wang Nannan. Cross Face-Voice Matching via Double-Stream Networks and Bi-Quintuple Loss[J]. Journal of Computer Research and Development, 2022, 59(3): 694-705. DOI: 10.7544/issn1000-1239.20200547
Citation:
Liu Xin, Wang Rui, Zhong Bineng, Wang Nannan. Cross Face-Voice Matching via Double-Stream Networks and Bi-Quintuple Loss[J]. Journal of Computer Research and Development, 2022, 59(3): 694-705. DOI: 10.7544/issn1000-1239.20200547
Liu Xin, Wang Rui, Zhong Bineng, Wang Nannan. Cross Face-Voice Matching via Double-Stream Networks and Bi-Quintuple Loss[J]. Journal of Computer Research and Development, 2022, 59(3): 694-705. DOI: 10.7544/issn1000-1239.20200547
Citation:
Liu Xin, Wang Rui, Zhong Bineng, Wang Nannan. Cross Face-Voice Matching via Double-Stream Networks and Bi-Quintuple Loss[J]. Journal of Computer Research and Development, 2022, 59(3): 694-705. DOI: 10.7544/issn1000-1239.20200547
3(Xiamen Key Laboratory of Computer Vision and Pattern Recognition (Huaqiao University), Xiamen, Fujian 361021)
4(School of Computer Science and Information Engineering, Guangxi Normal University, Guilin, Guangxi 541004)
Funds: This work was supported by the National Natural Science Foundation of China (61673185, 61922066, 61972167), the Project of State Key Laboratory of Integrated Services Networks (ISN20-11), the Natural Science Foundation of Fujian Province (2020J01084), and the Zhejiang Laboratory (2021KH0AB01).
Facial information and voice cues are the most natural and flexible ways in human-computer interaction, and some recent researchers are now paying more attention to the intelligent cross-modal perception between the face and voice modalities. Nevertheless, most existing methods often fail to perform well on some challenge cross-modal face-voice matching tasks, mainly due to the complex integration of semantic gap and modality heterogeneity. In this paper, we address an efficient cross-modal face-voice matching network by using double-stream networks and bi-quintuple loss, and the derived feature representations can be well utilized to adapt four challenging cross-modal matching tasks between faces and voices. First, we introduce a novel modality-shared multi-modal weighted residual network to model the face-voice association, by embedding it on the top layer of our double-stream network. Then, a bi-quintuple loss is newly proposed to significantly improve the data utilization, while enhancing the generalization ability of network model. Further, we learn to predict identity (ID) of each person during the training process, which can supervise the discriminative feature learning process. As a result, discriminative cross-modal representations can be well learned for different matching tasks. Within four different cross-modal matching tasks, extensive experiments have shown that the proposed approach performs better than the state-of-the-art methods, by a large margin reaching up to 5%.