Citation: | Huang Xuejian, Ma Tinghuai, Wang Gensheng. Multimodal Learning Method Based on Intra- and Inter-Sample Cooperative Representation and Adaptive Fusion[J]. Journal of Computer Research and Development, 2024, 61(5): 1310-1324. DOI: 10.7544/issn1000-1239.202330722 |
Multimodal machine learning represents a novel paradigm in artificial intelligence, leveraging various modalities and intelligent processing algorithms to achieve enhanced performance. Multimodal representation and fusion are two pivotal tasks in multimodal machine learning. Currently, most multimodal representation methods pay little attention to inter-sample collaboration, leading to a lack of robustness in feature representation. Additionally, most multimodal feature fusion methods exhibit sensitivity to noisy data. Therefore, in the realm of multimodal representation, an approach based on both intra-sample and inter-sample multimodal collaboration is proposed to facilitate a comprehensive understanding of interactions within and between modalities, ultimately enhancing the robustness of feature representation. Firstly, text, speech, and visual features are individually extracted based on pre-trained models such as BERT, Wav2vec 2.0, and Faster R-CNN. Subsequently, considering the complementarity and consistency of multimodal data, two categories of encoders, namely modality-specific and modality-shared, are constructed to learn both modality-specific and shared feature representations. Furthermore, intra-sample collaboration loss functions are formulated using central moment differences and orthogonality, while inter-sample collaboration loss functions are established using contrastive learning. Lastly, a representation learning function is designed based on intra-sample collaboration, inter-sample collaboration, and sample reconstruction errors. Regarding multimodal fusion, an adaptive multimodal feature fusion method is designed, accounting for the possibility that each modality may exhibit varying types of effects and levels of noise at different times, using attention mechanisms and gated neural networks. Experimental results on the multimodal intent recognition dataset MIntRec and emotion datasets CMU-MOSI and CMU-MOSEI demonstrate that this multimodal learning approach outperforms baseline methods across multiple evaluation metrics.
[1] |
Rahate A, Walambe R, Ramanna S, et al. Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions[J]. Information Fusion, 2022, 81: 203−239 doi: 10.1016/j.inffus.2021.12.003
|
[2] |
Liang P P, Lyu Y, Fan Xiang, et al. MultiBench: Multiscale benchmarks for multimodal representation learning[J]. arXiv preprint, arXiv: 2107.07502, 2021
|
[3] |
Hazarika D, Zimmermann R, Poria S. MISA: Modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proc of the 28th ACM Int Conf on Multimedia. New York: ACM, 2020: 1122−1131
|
[4] |
李学龙. 多模态认知计算[J]. 中国科学:信息科学,2023,53(1):1−32 doi: 10.1360/SSI-2022-0226
Li Xuelong. Multi-modal cognitive computing[J]. SCIENTIA SINICA Informationis, 2023, 53(1): 1−32(in Chinese) doi: 10.1360/SSI-2022-0226
|
[5] |
Wu Yang, Lin Zijie, Zhao Yanyan, et al. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis[C]//Proc of the 59th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2021: 4730−4738
|
[6] |
Sun Zhongkai, Sarma P, Sethares W, et al. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis[C]//Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 8992−8999
|
[7] |
Tsai Y H H, Liang P P, Zadeh A, et al. Learning factorized multimodal representations[J]. arXiv preprint, arXiv: 1806.06176, 2018
|
[8] |
Pham H, Liang P P, Manzini T, et al. Found in translation: Learning robust joint representations by cyclic translations between modalities[C] //Proc of the 33rd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2019: 6892−6899
|
[9] |
Wang Yansen, Shen Ying, Liu Zhun, et al. Words can shift: Dynamically adjusting word representations using nonverbal behaviors[C]//Proc of the 33rd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2019: 7216−7223
|
[10] |
Mai Sijie, Zeng Ying, Zheng Shuangjia, et al. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2023, 14(3): 2276−2289 doi: 10.1109/TAFFC.2022.3172360
|
[11] |
Huang Xuejian, Ma Tinghuai, Jia Li, et al. An effective multimodal representation and fusion method for multimodal intent recognition[J]. Neurocomputing, 2023, 548: 126373 doi: 10.1016/j.neucom.2023.126373
|
[12] |
Zhang Chao, Yang Zichao, He Xiaodong, et al. Multimodal intelligence: Representation learning, information fusion, and applications[J]. IEEE Journal of Selected Topics in Signal Processing, 2020, 14(3): 478−93 doi: 10.1109/JSTSP.2020.2987728
|
[13] |
张燕咏,张莎,张昱,等. 基于多模态融合的自动驾驶感知及计算[J]. 计算机研究与发展,2020,57(9):1781−1799 doi: 10.7544/issn1000-1239.2020.20200255
Zhang Yanyong, Zhang Sha, Zhang Yu, et al. Multi-modality fusion perception and computing in autonomous driving[J]. Journal of Computer Research and Development, 2020, 57(9): 1781−1799(in Chinese) doi: 10.7544/issn1000-1239.2020.20200255
|
[14] |
Xu Peng, Zhu Xiatian, Clifton D A. Multimodal learning with transformers: A survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(10): 12113−12132
|
[15] |
Zadeh A, Chen Minghai, Poria S, et al. Tensor fusion network for multimodal sentiment analysis[J]. arXiv preprint, arXiv: 1707.07250, 2017
|
[16] |
Liu Zhun, Shen Ying, Lakshminarasimhan V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[J]. arXiv preprint, arXiv: 1806.00064, 2018
|
[17] |
Ma Mengmeng, Ren Jia, Zhao Long, et al. SMIL: Multimodal learning with severely missing modality[C]//Proc of the 35th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2021: 2302−2310
|
[18] |
Abdullah S M S A, Ameen S Y A, Sadeeq M A M, et al. Multimodal emotion recognition using deep learning[J]. Journal of Applied Science and Technology Trends, 2021, 2(2): 52−58
|
[19] |
Jabeen S, Li Xi, Amin M S, et al. A review on methods and applications in multimodal deep learning[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 19(2s): 1−41
|
[20] |
Miyazawa K, Kyuragi Y, Nagai T. Simple and effective multimodal learning based on pre-trained transformer models[J]. IEEE Access, 2022, 10: 29821−29833 doi: 10.1109/ACCESS.2022.3159346
|
[21] |
Bayoudh K, Knani R, Hamdaoui F, et al. A survey on deep multimodal learning for computer vision: Advances, trends, applications, and datasets[J]. The Visual Computer, 2022, 38: 2939−2970 doi: 10.1007/s00371-021-02166-7
|
[22] |
Liang P P, Liu Ziyin, Zadeh A, et al. Multimodal language analysis with recurrent multistage fusion[J]. arXiv preprint, arXiv: 1808.03920, 2018
|
[23] |
Tsai Y H H, Bai Shaojie, Liang P P, et al. Multimodal Transformer for unaligned multimodal language sequences[C]//Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 6558–6569
|
[24] |
Mou Luntian, Zhou Chao, Zhao Pengfei, et al. Driver stress detection via multimodal fusion using attention-based CNN-LSTM[J]. Expert Systems with Applications, 2021, 173: 114693 doi: 10.1016/j.eswa.2021.114693
|
[25] |
Rahman W, Hasan M K, Lee S, et al. Integrating multimodal information in large pretrained transformers[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 2359−2369
|
[26] |
Baevski A, Zhou Yuhao, Mohamed A, et al. Wav2vec 2.0: A framework for self-supervised learning of speech representations[C]// Proc of the 34th Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2020: 12449−12460
|
[27] |
Hsu W N, Bolte B, Tsai Y H H, et al. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451−3460 doi: 10.1109/TASLP.2021.3122291
|
[28] |
Chen Sanyuan, Wang Chengyi, Chen Zhengyang, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing[J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1505−1518 doi: 10.1109/JSTSP.2022.3188113
|
[29] |
Zhang Hanlei, Xu Hua, Wang Xin, et al. MIntRec: A new dataset for multimodal intent recognition[C]//Proc of the 30th ACM Int Conf on Multimedia. New York: ACM, 2022: 1688−1697
|
[30] |
Tao Ruijie, Pan Zexu, Das R K, et al. Is someone speaking?: Exploring long-term temporal features for audio-visual active speaker detection[C]//Proc of the 29th ACM Int Conf on Multimedia. New York: ACM, 2021: 3927−3935
|
[31] |
Chauhan D S, Akhtar M S, Ekbal A, et al. Context-aware interactive attention for multi-modal sentiment and emotion analysis[C]// Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing (EMNLP-IJCNLP). Stroudsburg, PA: ACL, 2019: 5647−5657
|
[32] |
Li Qiuchi, Gkoumas D, Lioma C, et al. Quantum-inspired multimodal fusion for video sentiment analysis[J]. Information Fusion, 2021, 65: 58−71 doi: 10.1016/j.inffus.2020.08.006
|
[1] | Wang Chenze, Shen Xuehao, Huang Zhenli, Wang Zhengxia. Interactive Visualization Framework for Panoramic Super-Resolution Images Based on Localization Data[J]. Journal of Computer Research and Development, 2024, 61(7): 1741-1753. DOI: 10.7544/issn1000-1239.202330643 |
[2] | Fan Wei, Liu Yong. Social Network Information Diffusion Prediction Based on Spatial-Temporal Transformer[J]. Journal of Computer Research and Development, 2022, 59(8): 1757-1769. DOI: 10.7544/issn1000-1239.20220064 |
[3] | Zhou Weilin, Yang Yuan, Xu Mingwei. Network Function Virtualization Technology Research[J]. Journal of Computer Research and Development, 2018, 55(4): 675-688. DOI: 10.7544/issn1000-1239.2018.20170937 |
[4] | Yang Shuaifeng, Zhao Ruizhen. Image Super-Resolution Reconstruction Based on Low-Rank Matrix and Dictionary Learning[J]. Journal of Computer Research and Development, 2016, 53(4): 884-891. DOI: 10.7544/issn1000-1239.2016.20140726 |
[5] | Dou Nuo, Zhao Ruizhen, Cen Yigang, Hu Shaohai, Zhang Yongdong. Noisy Image Super-Resolution Reconstruction Based on Sparse Representation[J]. Journal of Computer Research and Development, 2015, 52(4): 943-951. DOI: 10.7544/issn1000-1239.2015.20140047 |
[6] | Yang Xin, Zhou Dake, Fei Shumin. A Self-Adapting Bilateral Total Variation Technology for Image Super-Resolution Reconstruction[J]. Journal of Computer Research and Development, 2012, 49(12): 2696-2701. |
[7] | Wang Kai, Hou Zifeng. A Relaxed Co-Scheduling Method of Virtual CPUs on Xen Virtual Machines[J]. Journal of Computer Research and Development, 2012, 49(1): 118-127. |
[8] | Wang Dan, Feng Dengguo, and Xu Zhen. An Approach to Data Sealing Based on Trusted Virtualization Platform[J]. Journal of Computer Research and Development, 2009, 46(8): 1325-1333. |
[9] | Xiao Chuangbai, Yu Jing, Xue Yi. A Novel Fast Algorithm for MAP Super-Resolution Image Reconstruction[J]. Journal of Computer Research and Development, 2009, 46(5): 872-880. |
[10] | Huang Hua, Fan Xin, Qi Chun, and Zhu Shihua. Face Image Super-Resolution Reconstruction Based on Recognition and Projection onto Convex Sets[J]. Journal of Computer Research and Development, 2005, 42(10): 1718-1725. |
1. |
刘韵洁,汪硕,黄韬,王佳森. 数算融合网络技术发展研究. 中国工程科学. 2025(01): 1-13 .
![]() |