基于分组提升集成的跨领域文本情感分类

赵传君; 王素格; 李德玉; 李欣

doi:10.7544/issn1000-1239.2015.20140156

基于分组提升集成的跨领域文本情感分类

Cross-Domain Text Sentiment Classification Based on Grouping-AdaBoost Ensemble

摘要

摘要: 针对目标领域带标签数据偏少的问题，综合运用半监督学习、BootStrapping、数据分组、AdaBoost、集成学习等策略与技术，提出了一种基于分组提升集成的跨领域文本情感分类方法.该方法首先利用少量人工标注的目标领域数据，基于合成过抽样技术产生一定数量的虚拟数据.在此基础上，采用BootStrapping方法获得更多目标领域高可信度的带标签数据.在分类器的构建方面，首先将源领域的带标签数据等量分割，并分别与目标领域带标签数据组合，在每个组合数据块上运用AdaBoost方法提升地训练多个分类器，并将这些分类器线性地集成为一个分类器.在亚马逊购物网站4个领域的情感数据集上的实验表明，基于分组提升集成的跨领域文本情感分类方法一定程度上提高了跨领域文本情感分类的精度.

Abstract: In the cross-domain sentiment classification, the labeled data in the target domain is often scarce and precious. To solve this problem, this paper proposes a grouping-AdaBoost ensemble classifier method by comprehensively using the strategies and techniques of semi-supervised learning, Bootstrapping, data grouping, AdaBoost, ensemble learning. Firstly, we adopt a small amount of labeled data in the target domain to generate a number of virtual data by using synthetic minority over-sampling technique. On this basis, we can obtain more data with high credibility label in the target domain by using Bootstrapping method. In the aspect of classifier construction, we firstly make an equivalent quantity partition to the labeled data in the source domain, and combine each part with the labeled data in the target domain to form the corresponding combined data sets. Corresponding to each combined data set, a classifier is trained, and it is then promoted by AdaBoost method. At last, these classifiers corresponding to the combined data sets are linearly integrated into an ensemble classifier. The experimental results on four data sets from Amazon online shopping reviews corpora indicate that the proposed method can improve the accuracy of cross-domain sentiment transformation effectively.

HTML全文

参考文献(0)

施引文献

资源附件(0)