Abstract:
Presented in this paper is a language modeling approach to the sentiment classification of text. It provides the semantic information beyond topic in text summary when characterizing the semantic orientation of texts as “thumb up” or “thumb down”. The motivation is simple: “thumb up” and “thumb down” language models are likely to be substantially different: they prefer to different language habits. This divergence is exploited in the language models to effectively classify test documents. Therefore, the method can be deployed in two stages: firstly, the two sentiment language models are estimated from training data; secondly, tests are done through comparing the Kullback-Leibler divergence between the language model estimated from test document and those two trained sentiment models. The unigrams and bigrams of words are employed as the model parameters, and correspondingly maximum likelihood estimation and smoothing techniques are used to estimate these parameters. Compared with two different classifiers, i.e. SVMs and Nave Bayes, on movie review corpus when training data is limited, the language modeling approach performs better than SVMs and Nave Bayes classifier, and on the other hand it shows its robustness in sentiment classification. Future works may focus on finding a good way to estimate better language models, especially the higher order n-gram models and more powerful smoothing methods.