Abstract:
Multimodal machine learning represents a novel paradigm in artificial intelligence, leveraging various modalities and intelligent processing algorithms to achieve enhanced performance. Multimodal representation and fusion are two pivotal tasks in multimodal machine learning. Currently, most multimodal representation methods pay little attention to inter-sample collaboration, leading to a lack of robustness in feature representation. Additionally, most multimodal feature fusion methods exhibit sensitivity to noisy data. Therefore, in the realm of multimodal representation, an approach based on both intra-sample and inter-sample multimodal collaboration is proposed to facilitate a comprehensive understanding of interactions within and between modalities, ultimately enhancing the robustness of feature representation. Firstly, text, speech, and visual features are individually extracted based on pre-trained models such as BERT, Wav2vec 2.0, and Faster R-CNN. Subsequently, considering the complementarity and consistency of multimodal data, two categories of encoders, namely modality-specific and modality-shared, are constructed to learn both modality-specific and shared feature representations. Furthermore, intra-sample collaboration loss functions are formulated using central moment differences and orthogonality, while inter-sample collaboration loss functions are established using contrastive learning. Lastly, a representation learning function is designed based on intra-sample collaboration, inter-sample collaboration, and sample reconstruction errors. Regarding multimodal fusion, an adaptive multimodal feature fusion method is designed, accounting for the possibility that each modality may exhibit varying types of effects and levels of noise at different times, using attention mechanisms and gated neural networks. Experimental results on the multimodal intent recognition dataset MIntRec and emotion datasets CMU-MOSI and CMU-MOSEI demonstrate that this multimodal learning approach outperforms baseline methods across multiple evaluation metrics.