ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (3): 629-638.doi: 10.7544/issn1000-1239.2015.20140156

• 信息处理 • 上一篇    下一篇

基于分组提升集成的跨领域文本情感分类

赵传君1,王素格1,2,李德玉1,2,李欣1   

  1. 1(山西大学计算机与信息技术学院 太原 030006); 2(计算智能与中文信息处理教育部重点实验室(山西大学) 太原 030006) (wsg@sxu.edu.cn)
  • 出版日期: 2015-03-01
  • 基金资助: 
    基金项目:国家自然科学基金项目(61175067,61272095,61405109);国家“八六三”高技术研究发展计划基金项目(2015AA015407);山西省回国留学人员科研项目(2013-014);山西省自然科学基金项目(2013011066-4);山西省科技攻关项目(20110321027-02)

Cross-Domain Text Sentiment Classification Based on Grouping-AdaBoost Ensemble

Zhao Chuanjun1, Wang Suge1,2, Li Deyu1,2, Li Xin1   

  1. 1(School of Computer and Information Technology, Shanxi University, Taiyuan 030006); 2(Key Laboratory of Computational Intelligence and Chinese Information Processing (Shanxi University), Ministry of Education, Taiyuan 030006)
  • Online: 2015-03-01

摘要: 针对目标领域带标签数据偏少的问题,综合运用半监督学习、BootStrapping、数据分组、AdaBoost、集成学习等策略与技术,提出了一种基于分组提升集成的跨领域文本情感分类方法.该方法首先利用少量人工标注的目标领域数据,基于合成过抽样技术产生一定数量的虚拟数据.在此基础上,采用BootStrapping方法获得更多目标领域高可信度的带标签数据.在分类器的构建方面,首先将源领域的带标签数据等量分割,并分别与目标领域带标签数据组合,在每个组合数据块上运用AdaBoost方法提升地训练多个分类器,并将这些分类器线性地集成为一个分类器.在亚马逊购物网站4个领域的情感数据集上的实验表明,基于分组提升集成的跨领域文本情感分类方法一定程度上提高了跨领域文本情感分类的精度.

关键词: 情感分类, 跨领域, 合成过抽样技术, 分组提升, 集成分类器

Abstract: In the cross-domain sentiment classification, the labeled data in the target domain is often scarce and precious. To solve this problem, this paper proposes a grouping-AdaBoost ensemble classifier method by comprehensively using the strategies and techniques of semi-supervised learning, Bootstrapping, data grouping, AdaBoost, ensemble learning. Firstly, we adopt a small amount of labeled data in the target domain to generate a number of virtual data by using synthetic minority over-sampling technique. On this basis, we can obtain more data with high credibility label in the target domain by using Bootstrapping method. In the aspect of classifier construction, we firstly make an equivalent quantity partition to the labeled data in the source domain, and combine each part with the labeled data in the target domain to form the corresponding combined data sets. Corresponding to each combined data set, a classifier is trained, and it is then promoted by AdaBoost method. At last, these classifiers corresponding to the combined data sets are linearly integrated into an ensemble classifier. The experimental results on four data sets from Amazon online shopping reviews corpora indicate that the proposed method can improve the accuracy of cross-domain sentiment transformation effectively.

Key words: sentiment classification, cross-domain, synthetic minority over-sampling technique, grouping-AdaBoost, ensemble classifier

中图分类号: