基于PU学习算法的虚假评论识别研究

任亚峰; 姬东鸿; 张红斌; 尹兰

doi:10.7544/issn1000-1239.2015.20131473

基于PU学习算法的虚假评论识别研究

Deceptive Reviews Detection Based on Positive and Unlabeled Learning

摘要

摘要: 识别虚假评论有着重要的理论意义与现实价值.先前工作集中于启发式策略和传统的全监督学习算法.最近研究表明：人类无法通过先验知识有效识别虚假评论，手工标注的数据集必定存在一定数量的误例，因此简单使用传统的全监督学习算法识别虚假评论并不合理.容易被错误标注的样例称为间谍样例，如何确定这些样例的类别标签将直接影响分类器的性能.基于少量的真实评论和大量的未标注评论，提出一种创新的PU(positive and unlabeled)学习框架来识别虚假评论.首先，从无标注数据集中识别出少量可信度较高的负例.其次，通过整合LDA(latent Dirichlet allocation)和K-means，分别计算出多个代表性的正例和负例.接着，基于狄利克雷过程混合模型(Dirichlet process mixture model, DPMM)，对所有间谍样例进行聚类，混合种群性和个体性策略来确定间谍样例的类别标签.最后，多核学习算法被用来训练最终的分类器.数值实验证实了所提算法的有效性，超过当前的基准.

Abstract: Identifying deceptive reviews has important theoretical meaning and practical value. While previous works focus on some heuristic rules or traditional supervised methods. Recent research has shown that humans cannot directly identify deceptive reviews by their prior knowledge. Human-annotated dataset must contain some mislabeled examples. Due to the difficulty of human labeling needed for supervised learning, the problem remains to be highly challenging. There are some ambiguous reviews (we call them spy examples), which are easily mislabeled. The key of identifying deceptive review is how to deal with these spy reviews. Based on some truthful reviews and a large amount of unlabeled reviews, a novel approach, called mixing population and individual nature PU learning, is proposed. Firstly, some reliable negative examples are identified from the unlabeled dataset. Secondly, some representative positive examples and negative examples are generated by integrating latent dirichlet allocation and K-means. Thirdly, all spy examples are clustered into many groups based on dirichlet process mixture model, and two schemes (population nature and individual nature) are mixed to determine the category label of spy examples. Finally, multiple kernel learning is presented to build the final classifier. Experimental results demonstrate that our proposed methods can effectively identify deceptive reviews, and outperform the current baselines.

HTML全文

参考文献(0)

施引文献

资源附件(0)