基于集成学习的中文文本欺骗检测研究

张虎; 谭红叶; 钱宇华; 李茹; 陈千

doi:10.7544/issn1000-1239.2015.20131552

基于集成学习的中文文本欺骗检测研究

Chinese Text Deception Detection Based on Ensemble Learning

摘要

摘要: 欺骗信息检测是信息安全领域中的重要研究内容.现有的研究表明，三分之一的人际交往中会涉及到潜在的欺骗，大量的欺骗信息充斥在各种各样的通信媒介中，在海量的网络信息中欺骗性数据的规模通常远小于非欺骗性数据的规模，已有方法还不能很好地适应于准确高效地欺骗检测，迫切期望提出一种能高效地检测欺骗信息的方法.针对具有非平衡性的海量网络信息，提出了一种基于集成学习的欺骗行为检测方法.通过改进的二分k-means划分方法对训练样本集进行分解，分别在每对正负样本集上学习各自独立的分类器，然后利用每个独立分类器分别计算待测样本的类别输出值，并采用结合个体分类器分类正确率的最小最大模块化方法集成每个判别结果.实验结果验证了该方法的有效性.

Abstract: Deception detection is important in the field of information security. Existing researches show that one third of the interpersonal communication involves the potential deceptions, and there are large amounts of deceptive messages in the more and more Web information. If the deception is potentially dangerous to people's life, the survival of enterprise and the stability of the country, then the negligence of deception may lead to incalculable loss. In the massive amounts of information the scale of the non-deceptive texts is much larger than the scale of the deceptive texts, so people remain unsuccessful and inefficient in detecting those deceptive messages by the existing methods, and it is desirable to create an automated method which could help people flag the possible deceptive messages. In this paper, we built a deception detection model based on ensemble learning to solve the imbalance of the existing data sets. Firstly a novel bisecting k-means method is proposed to cut the training sample set, and the separate classifiers are trained by using each pair of positive and negative samples, and then each test sample category value is calculated by the classifiers, and finally a novel min-max modular approach is used to integrate each category result. Experimental results verify the effectiveness of this method.

HTML全文

参考文献(0)

施引文献

资源附件(0)