基于语言建模的文本情感分类研究

胡  熠; 陆汝占; 李学宁; 段建勇; 陈玉泉

基于语言建模的文本情感分类研究

Research on Language Modeling Based Sentiment Classification of Text

摘要

摘要: 提出了一种基于语言建模的文本情感分类的方法.将文本的情感倾向标记为“赞扬”或“批评”，可以为文本提供主题之外的语义信息.为此提出了从训练数据中分别估计出代表“赞扬”和“批评”两种情感倾向的语言模型，然后通过比较测试文本自身的语言模型和这两种训练好的情感模型之间的Kullback-Leibler距离，分类测试文本的思路.各个模型的参数分别选用词形特征的unigram和bigram，而相应的参数估计也分别尝试了最大似然和平滑两种策略.当在电影评论语料上和代表不同分类模型的支持向量机及朴素贝叶斯分类器进行比较时，语言建模的方法表现出了较好的分类性能和鲁棒性.

Abstract: Presented in this paper is a language modeling approach to the sentiment classification of text. It provides the semantic information beyond topic in text summary when characterizing the semantic orientation of texts as “thumb up” or “thumb down”. The motivation is simple: “thumb up” and “thumb down” language models are likely to be substantially different: they prefer to different language habits. This divergence is exploited in the language models to effectively classify test documents. Therefore, the method can be deployed in two stages: firstly, the two sentiment language models are estimated from training data; secondly, tests are done through comparing the Kullback-Leibler divergence between the language model estimated from test document and those two trained sentiment models. The unigrams and bigrams of words are employed as the model parameters, and correspondingly maximum likelihood estimation and smoothing techniques are used to estimate these parameters. Compared with two different classifiers, i.e. SVMs and Nave Bayes, on movie review corpus when training data is limited, the language modeling approach performs better than SVMs and Nave Bayes classifier, and on the other hand it shows its robustness in sentiment classification. Future works may focus on finding a good way to estimate better language models, especially the higher order n-gram models and more powerful smoothing methods.

HTML全文

参考文献(0)

施引文献

资源附件(0)