Abstract:
Multimodal sentiment analysis is a multimodal task that uses multiple modalities of subjective information to analyze sentiment. In some scenarios, the sentimental expression in different modalities is inconsistent, even contradictory, which will weaken the effect of multimodal collaborative decision-making. In this paper, a multimodal learning method is proposed to learn the modal feature representations with consistent sentimental semantics. In order to improve the common feature representation of different modalities and learn the dynamic interaction between modalities without affecting the original information, we first learn the common feature representation of each modality, and then use cross attention to enable one modality to effectively obtain auxiliary information from the common feature representations of other modalities. In multimodal fusion, we propose a multimodal attention, which is used to weighted concatenate modal feature representations, in order to increase the expression of contributed modalities and suppress the influence of weak modalities. The experimental results of the proposed method on the sentiment analysis datasets MOSI, MOSEI, CH-SIMS are better than those of the comparison models, indicating the necessity and rationality of considering the problem of sentimental semantic inconsistency in multimodal sentiment analysis.