高级检索
    黄学坚, 马廷淮, 王根生. 基于样本内外协同表示和自适应融合的多模态学习方法[J]. 计算机研究与发展, 2024, 61(5): 1310-1324. DOI: 10.7544/issn1000-1239.202330722
    引用本文: 黄学坚, 马廷淮, 王根生. 基于样本内外协同表示和自适应融合的多模态学习方法[J]. 计算机研究与发展, 2024, 61(5): 1310-1324. DOI: 10.7544/issn1000-1239.202330722
    Huang Xuejian, Ma Tinghuai, Wang Gensheng. Multimodal Learning Method Based on Intra- and Inter-Sample Cooperative Representation and Adaptive Fusion[J]. Journal of Computer Research and Development, 2024, 61(5): 1310-1324. DOI: 10.7544/issn1000-1239.202330722
    Citation: Huang Xuejian, Ma Tinghuai, Wang Gensheng. Multimodal Learning Method Based on Intra- and Inter-Sample Cooperative Representation and Adaptive Fusion[J]. Journal of Computer Research and Development, 2024, 61(5): 1310-1324. DOI: 10.7544/issn1000-1239.202330722

    基于样本内外协同表示和自适应融合的多模态学习方法

    Multimodal Learning Method Based on Intra- and Inter-Sample Cooperative Representation and Adaptive Fusion

    • 摘要: 多模态机器学习是一种新的人工智能范式,结合各种模态和智能处理算法以实现更高的性能. 多模态表示和多模态融合是多模态机器学习的2个关键任务. 目前,多模态表示方法很少考虑样本间的协同,导致特征表示缺乏鲁棒性,大部分多模态特征融合方法对噪声数据敏感. 因此,在多模态表示方面,为了充分学习模态内和模态间的交互,提升特征表示的鲁棒性,提出一种基于样本内和样本间多模态协同的表示方法. 首先,分别基于预训练的BERT,Wav2vec 2.0,Faster R-CNN提取文本特征、语音特征和视觉特征;其次,针对多模态数据的互补性和一致性,构建模态特定和模态共用2类编码器,分别学习模态特有和共享2种特征表示;然后,利用中心矩差异和正交性构建样本内协同损失函数,采用对比学习构建样本间协同损失函数;最后,基于样本内协同误差、样本间协同误差和样本重构误差设计表示学习函数. 在多模态融合方面,针对每种模态可能在不同时刻表现出不同作用类型和不同级别的噪声,设计一种基于注意力机制和门控神经网络的自适应的多模态特征融合方法. 在多模态意图识别数据集MIntRec和情感数据集CMU-MOSI,CMU-MOSEI上的实验结果表明,该多模态学习方法在多个评价指标上优于基线方法.

       

      Abstract: Multimodal machine learning represents a novel paradigm in artificial intelligence, leveraging various modalities and intelligent processing algorithms to achieve enhanced performance. Multimodal representation and fusion are two pivotal tasks in multimodal machine learning. Currently, most multimodal representation methods pay little attention to inter-sample collaboration, leading to a lack of robustness in feature representation. Additionally, most multimodal feature fusion methods exhibit sensitivity to noisy data. Therefore, in the realm of multimodal representation, an approach based on both intra-sample and inter-sample multimodal collaboration is proposed to facilitate a comprehensive understanding of interactions within and between modalities, ultimately enhancing the robustness of feature representation. Firstly, text, speech, and visual features are individually extracted based on pre-trained models such as BERT, Wav2vec 2.0, and Faster R-CNN. Subsequently, considering the complementarity and consistency of multimodal data, two categories of encoders, namely modality-specific and modality-shared, are constructed to learn both modality-specific and shared feature representations. Furthermore, intra-sample collaboration loss functions are formulated using central moment differences and orthogonality, while inter-sample collaboration loss functions are established using contrastive learning. Lastly, a representation learning function is designed based on intra-sample collaboration, inter-sample collaboration, and sample reconstruction errors. Regarding multimodal fusion, an adaptive multimodal feature fusion method is designed, accounting for the possibility that each modality may exhibit varying types of effects and levels of noise at different times, using attention mechanisms and gated neural networks. Experimental results on the multimodal intent recognition dataset MIntRec and emotion datasets CMU-MOSI and CMU-MOSEI demonstrate that this multimodal learning approach outperforms baseline methods across multiple evaluation metrics.

       

    /

    返回文章
    返回