Abstract:
Most existing research of sentiment analysis are based on either textual or visual data and can not achieve satisfied results. As multi-modal data can provide richer information, multi-modal sentiment analysis is attracting more and more attentions and has become a hot research topic. Due to the strong semantic correlation between visual data and the co-occurrence textual data in social media, mixed data of texts and images provides a new view to learn better classifier for social media sentiment classification. A hierarchical deep correlative fusion network framework is proposed to jointly learn textual and visual sentiment representations from training samples for sentiment classification. In order to alleviate the problem of fine-grained semantic matching between image and text, both the middle level semantic features of images and the deep multi-modal discriminative correlation analysis are applied to learn the most relevant visual feature representation and semantic feature representation, meanwhile, keeping both the visual and semantic feature representations to be linear discriminable. Motivated by the successful use of attention mechanisms, we further propose a multi-modal attention fusion network by incorporating visual and semantic feature representations to train sentiment classifier. Experiments on the real-world datasets which come from social networks show that, the proposed method gets more accurate prediction on multi-media sentiment analysis by capturing the internal relations between text and image hierarchically.