Multimodal Learning Method Based on Intra- and Inter-Sample Cooperative Representation and Adaptive Fusion

Huang Xuejian; Ma Tinghuai; Wang Gensheng

doi:10.7544/issn1000-1239.202330722

Journal of Computer Research and Development > 2024 > 61(5): 1310-1324. > DOI: 10.7544/issn1000-1239.202330722 CSTR: 32373.14.issn1000-1239.202330722

Huang Xuejian, Ma Tinghuai, Wang Gensheng. Multimodal Learning Method Based on Intra- and Inter-Sample Cooperative Representation and Adaptive Fusion[J]. Journal of Computer Research and Development, 2024, 61(5): 1310-1324. DOI: 10.7544/issn1000-1239.202330722

Citation:

PDF (2164 KB)

Multimodal Learning Method Based on Intra- and Inter-Sample Cooperative Representation and Adaptive Fusion

1.
College of VR Modern Industry, Jiangxi University of Finance and Economics, Nanchang 330013
2.
College of Computer, Nanjing University of Information Science and Technology, Nanjing 210044
3.
College of Information Management, Jiangxi University of Finance and Economics, Nanchang 330013

Funds: This work was supported by the National Natural Science Foundation of China (62372243, 72061015, 62102187).

More Information

Author Bio:
Huang Xuejian: born in 1990. PhD candidate, lecturer. Member of CCF. His main research interests include multimodal machine learning and social network analysis

Ma Tinghuai: born in 1974. PhD, professor, PhD supervisor. Senior member of CCF. His main research interests include social network privacy protection, big data mining, and text emotion computing

Wang Gensheng: born in 1974. PhD, professor, PhD supervisor. Member of CCF. His main research interests include data mining and social network
Received Date: September 05, 2023
Revised Date: February 19, 2024

Graphical Abstract

Abstract

Abstract

Multimodal machine learning represents a novel paradigm in artificial intelligence, leveraging various modalities and intelligent processing algorithms to achieve enhanced performance. Multimodal representation and fusion are two pivotal tasks in multimodal machine learning. Currently, most multimodal representation methods pay little attention to inter-sample collaboration, leading to a lack of robustness in feature representation. Additionally, most multimodal feature fusion methods exhibit sensitivity to noisy data. Therefore, in the realm of multimodal representation, an approach based on both intra-sample and inter-sample multimodal collaboration is proposed to facilitate a comprehensive understanding of interactions within and between modalities, ultimately enhancing the robustness of feature representation. Firstly, text, speech, and visual features are individually extracted based on pre-trained models such as BERT, Wav2vec 2.0, and Faster R-CNN. Subsequently, considering the complementarity and consistency of multimodal data, two categories of encoders, namely modality-specific and modality-shared, are constructed to learn both modality-specific and shared feature representations. Furthermore, intra-sample collaboration loss functions are formulated using central moment differences and orthogonality, while inter-sample collaboration loss functions are established using contrastive learning. Lastly, a representation learning function is designed based on intra-sample collaboration, inter-sample collaboration, and sample reconstruction errors. Regarding multimodal fusion, an adaptive multimodal feature fusion method is designed, accounting for the possibility that each modality may exhibit varying types of effects and levels of noise at different times, using attention mechanisms and gated neural networks. Experimental results on the multimodal intent recognition dataset MIntRec and emotion datasets CMU-MOSI and CMU-MOSEI demonstrate that this multimodal learning approach outperforms baseline methods across multiple evaluation metrics.
- multimodal representation,
- multimodal fusion,
- multimodal learning,
- collaborative representation,
- adaptive fusion

FullText(HTML)

References (32)

References

[1]	Rahate A, Walambe R, Ramanna S, et al. Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions[J]. Information Fusion, 2022, 81: 203−239 doi: 10.1016/j.inffus.2021.12.003
[2]	Liang P P, Lyu Y, Fan Xiang, et al. MultiBench: Multiscale benchmarks for multimodal representation learning[J]. arXiv preprint, arXiv: 2107.07502, 2021
[3]	Hazarika D, Zimmermann R, Poria S. MISA: Modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proc of the 28th ACM Int Conf on Multimedia. New York: ACM, 2020: 1122−1131
[4]	李学龙. 多模态认知计算[J]. 中国科学:信息科学,2023,53(1):1−32 doi: 10.1360/SSI-2022-0226 Li Xuelong. Multi-modal cognitive computing[J]. SCIENTIA SINICA Informationis, 2023, 53(1): 1−32(in Chinese) doi: 10.1360/SSI-2022-0226
[5]	Wu Yang, Lin Zijie, Zhao Yanyan, et al. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis[C]//Proc of the 59th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2021: 4730−4738
[6]	Sun Zhongkai, Sarma P, Sethares W, et al. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis[C]//Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 8992−8999
[7]	Tsai Y H H, Liang P P, Zadeh A, et al. Learning factorized multimodal representations[J]. arXiv preprint, arXiv: 1806.06176, 2018
[8]	Pham H, Liang P P, Manzini T, et al. Found in translation: Learning robust joint representations by cyclic translations between modalities[C] //Proc of the 33rd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2019: 6892−6899
[9]	Wang Yansen, Shen Ying, Liu Zhun, et al. Words can shift: Dynamically adjusting word representations using nonverbal behaviors[C]//Proc of the 33rd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2019: 7216−7223
[10]	Mai Sijie, Zeng Ying, Zheng Shuangjia, et al. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2023, 14(3): 2276−2289 doi: 10.1109/TAFFC.2022.3172360
[11]	Huang Xuejian, Ma Tinghuai, Jia Li, et al. An effective multimodal representation and fusion method for multimodal intent recognition[J]. Neurocomputing, 2023, 548: 126373 doi: 10.1016/j.neucom.2023.126373
[12]	Zhang Chao, Yang Zichao, He Xiaodong, et al. Multimodal intelligence: Representation learning, information fusion, and applications[J]. IEEE Journal of Selected Topics in Signal Processing, 2020, 14(3): 478−93 doi: 10.1109/JSTSP.2020.2987728
[13]	张燕咏,张莎,张昱,等. 基于多模态融合的自动驾驶感知及计算[J]. 计算机研究与发展,2020,57(9):1781−1799 doi: 10.7544/issn1000-1239.2020.20200255 Zhang Yanyong, Zhang Sha, Zhang Yu, et al. Multi-modality fusion perception and computing in autonomous driving[J]. Journal of Computer Research and Development, 2020, 57(9): 1781−1799(in Chinese) doi: 10.7544/issn1000-1239.2020.20200255
[14]	Xu Peng, Zhu Xiatian, Clifton D A. Multimodal learning with transformers: A survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(10): 12113−12132
[15]	Zadeh A, Chen Minghai, Poria S, et al. Tensor fusion network for multimodal sentiment analysis[J]. arXiv preprint, arXiv: 1707.07250, 2017
[16]	Liu Zhun, Shen Ying, Lakshminarasimhan V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[J]. arXiv preprint, arXiv: 1806.00064, 2018
[17]	Ma Mengmeng, Ren Jia, Zhao Long, et al. SMIL: Multimodal learning with severely missing modality[C]//Proc of the 35th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2021: 2302−2310
[18]	Abdullah S M S A, Ameen S Y A, Sadeeq M A M, et al. Multimodal emotion recognition using deep learning[J]. Journal of Applied Science and Technology Trends, 2021, 2(2): 52−58
[19]	Jabeen S, Li Xi, Amin M S, et al. A review on methods and applications in multimodal deep learning[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 19(2s): 1−41
[20]	Miyazawa K, Kyuragi Y, Nagai T. Simple and effective multimodal learning based on pre-trained transformer models[J]. IEEE Access, 2022, 10: 29821−29833 doi: 10.1109/ACCESS.2022.3159346
[21]	Bayoudh K, Knani R, Hamdaoui F, et al. A survey on deep multimodal learning for computer vision: Advances, trends, applications, and datasets[J]. The Visual Computer, 2022, 38: 2939−2970 doi: 10.1007/s00371-021-02166-7
[22]	Liang P P, Liu Ziyin, Zadeh A, et al. Multimodal language analysis with recurrent multistage fusion[J]. arXiv preprint, arXiv: 1808.03920, 2018
[23]	Tsai Y H H, Bai Shaojie, Liang P P, et al. Multimodal Transformer for unaligned multimodal language sequences[C]//Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 6558–6569
[24]	Mou Luntian, Zhou Chao, Zhao Pengfei, et al. Driver stress detection via multimodal fusion using attention-based CNN-LSTM[J]. Expert Systems with Applications, 2021, 173: 114693 doi: 10.1016/j.eswa.2021.114693
[25]	Rahman W, Hasan M K, Lee S, et al. Integrating multimodal information in large pretrained transformers[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 2359−2369
[26]	Baevski A, Zhou Yuhao, Mohamed A, et al. Wav2vec 2.0: A framework for self-supervised learning of speech representations[C]// Proc of the 34th Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2020: 12449−12460
[27]	Hsu W N, Bolte B, Tsai Y H H, et al. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451−3460 doi: 10.1109/TASLP.2021.3122291
[28]	Chen Sanyuan, Wang Chengyi, Chen Zhengyang, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing[J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1505−1518 doi: 10.1109/JSTSP.2022.3188113
[29]	Zhang Hanlei, Xu Hua, Wang Xin, et al. MIntRec: A new dataset for multimodal intent recognition[C]//Proc of the 30th ACM Int Conf on Multimedia. New York: ACM, 2022: 1688−1697
[30]	Tao Ruijie, Pan Zexu, Das R K, et al. Is someone speaking?: Exploring long-term temporal features for audio-visual active speaker detection[C]//Proc of the 29th ACM Int Conf on Multimedia. New York: ACM, 2021: 3927−3935
[31]	Chauhan D S, Akhtar M S, Ekbal A, et al. Context-aware interactive attention for multi-modal sentiment and emotion analysis[C]// Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing (EMNLP-IJCNLP). Stroudsburg, PA: ACL, 2019: 5647−5657
[32]	Li Qiuchi, Gkoumas D, Lioma C, et al. Quantum-inspired multimodal fusion for video sentiment analysis[J]. Information Fusion, 2021, 65: 58−71 doi: 10.1016/j.inffus.2020.08.006

[1]	Lin Hanyue, Wu Jingya, Lu Wenyan, Zhong Langhui, Yan Guihai. Neptune: A Framework for Generic Network Processor Microarchitecture Modeling and Performance Simulation[J]. Journal of Computer Research and Development, 2025, 62(5): 1091-1107. DOI: 10.7544/issn1000-1239.202440084
[2]	Zhang Qianlong, Hou Rui, Yang Sibo, Zhao Boyan, Zhang Lixin. The Role of Architecture Simulators in the Process of CPU Design[J]. Journal of Computer Research and Development, 2019, 56(12): 2702-2719. DOI: 10.7544/issn1000-1239.2019.20190044
[3]	Liu He, Ji Yu, Han Jianhui, Zhang Youhui, Zheng Weimin. Training and Software Simulation for ReRAM-Based LSTM Neural Network Acceleration[J]. Journal of Computer Research and Development, 2019, 56(6): 1182-1191. DOI: 10.7544/issn1000-1239.2019.20190113
[4]	Yang Meifang, Che Yonggang, Gao Xiang. Heterogeneous Parallel Optimization of an Engine Combustion Simulation Application with the OpenMP 4.0 Standard[J]. Journal of Computer Research and Development, 2018, 55(2): 400-408. DOI: 10.7544/issn1000-1239.2018.20160872
[5]	Liu Yuchen, Wang Jia, Chen Yunji, Jiao Shuai. Survey on Computer System Simulator[J]. Journal of Computer Research and Development, 2015, 52(1): 3-15. DOI: 10.7544/issn1000-1239.2015.20140104
[6]	Lü Huiwei, Cheng Yuan, Bai Lu, Chen Mingyu, Fan Dongrui, Sun Ninghui. Parallel Simulation of Many-Core Processor and Many-Core Clusters[J]. Journal of Computer Research and Development, 2013, 50(5): 1110-1117.
[7]	Yu Lisheng, Zhang Yansong, Wang Shan, and Zhang Qian. Research on Simulative Column-Storage Model Policy Based on Row-Storage Model[J]. Journal of Computer Research and Development, 2010, 47(5): 878-885.
[8]	Liu Shiguang, Chai Jiawei, Wen Yuan. A New Method for Fast Simulation of 3D Clouds[J]. Journal of Computer Research and Development, 2009, 46(9): 1417-1423.
[9]	Mao Chengying, Lu Yansheng. Strategies of Regression Test Case Selection for Component-Based Software[J]. Journal of Computer Research and Development, 2006, 43(10): 1767-1774.
[10]	Wang Shihao, Wang Xinmin, Liu Mingye. Software Simulation for Hardware/Software Co-Verification[J]. Journal of Computer Research and Development, 2005, 42(3).