ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2022, Vol. 59 ›› Issue (3): 694-705.doi: 10.7544/issn1000-1239.20200547

• 人工智能 • 上一篇    下一篇



  1. 1(华侨大学计算机科学与技术学院 福建厦门 361021);2(综合业务网理论及关键技术国家重点实验室(西安电子科技大学) 西安 710071);3(厦门市计算机视觉与模式识别重点实验室(华侨大学) 福建厦门 361021);4(广西师范大学计算机科学与信息工程学院 广西桂林 541004) (
  • 出版日期: 2022-03-07
  • 基金资助: 

Cross Face-Voice Matching via Double-Stream Networks and Bi-Quintuple Loss

Liu Xin1,2,3, Wang Rui1,3, Zhong Bineng4, Wang Nannan2   

  1. 1(College of Computer Science and Technology, Huaqiao University, Xiamen, Fujian 361021);2(State Key Laboratory of Integrated Services Networks (Xidian University), Xi’an 710071);3(Xiamen Key Laboratory of Computer Vision and Pattern Recognition (Huaqiao University), Xiamen, Fujian 361021);4(School of Computer Science and Information Engineering, Guangxi Normal University, Guilin, Guangxi 541004)
  • Online: 2022-03-07
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (61673185, 61922066, 61972167), the Project of State Key Laboratory of Integrated Services Networks (ISN20-11), the Natural Science Foundation of Fujian Province (2020J01084), and the Zhejiang Laboratory (2021KH0AB01).

摘要: 面部视觉信息和语音信息是人机交互过程中最为直接和灵活的方式,从而基于智能方式的人脸和语音跨模态感知吸引了国内外研究学者的广泛关注.然而,由于人脸-语音样本的异质性以及语义鸿沟问题,现有方法并不能很好地解决一些难度比较高的跨人脸-语音匹配任务.提出了一种结合双流网络和双向五元组损失的跨人脸-语音特征学习框架,该框架学到的特征可直接用于4种不同的跨人脸-语音匹配任务.首先,在双流深度网络顶端引入一种新的权重共享的多模态加权残差网络,以挖掘人脸和语音模态间的语义关联;接着,设计了一种融合多种样本对构造策略的双向五元组损失,极大地提高了数据利用率和模型的泛化性能;最后,在模型训练中进行ID分类学习,以保证跨模态表示的可分性.实验结果表明,与现有方法相比,能够在4个不同跨人脸-语音匹配任务上取得效果的全面提升,某些评价指标效果提升近5%.

关键词: 人脸-语音关联, 跨模态感知, 双流网络, 双向五元组损失, 加权残差网络

Abstract: Facial information and voice cues are the most natural and flexible ways in human-computer interaction, and some recent researchers are now paying more attention to the intelligent cross-modal perception between the face and voice modalities. Nevertheless, most existing methods often fail to perform well on some challenge cross-modal face-voice matching tasks, mainly due to the complex integration of semantic gap and modality heterogeneity. In this paper, we address an efficient cross-modal face-voice matching network by using double-stream networks and bi-quintuple loss, and the derived feature representations can be well utilized to adapt four challenging cross-modal matching tasks between faces and voices. First, we introduce a novel modality-shared multi-modal weighted residual network to model the face-voice association, by embedding it on the top layer of our double-stream network. Then, a bi-quintuple loss is newly proposed to significantly improve the data utilization, while enhancing the generalization ability of network model. Further, we learn to predict identity (ID) of each person during the training process, which can supervise the discriminative feature learning process. As a result, discriminative cross-modal representations can be well learned for different matching tasks. Within four different cross-modal matching tasks, extensive experiments have shown that the proposed approach performs better than the state-of-the-art methods, by a large margin reaching up to 5%.

Key words: face-voice associations, cross-modal perception, double-stream networks, bi-quintuple loss, weighted residual network