ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (5): 1005-1013.doi: 10.7544/issn1000-1239.2015.20131552

• 信息处理 • 上一篇    下一篇

基于集成学习的中文文本欺骗检测研究

张虎,谭红叶,钱宇华,李茹,陈千   

  1. (山西大学计算机与信息技术学院 太原 030006) (zhanghu@sxu.edu.cn)
  • 出版日期: 2015-05-01
  • 基金资助: 
    基金项目:国家自然科学基金项目(61005053,61100138,61373082,61322211);国家“八六三”高技术研究发展计划基金项目(2015AA015407);新世纪优秀人才支持计划基金项目(20121401110013);山西省回国留学人员科研资助项目(2013-022);山西省高等学校科技创新项目(2015104);中国民航大学信息安全评测中心开放课题基金项目(CAAC-ISECCA-201402)

Chinese Text Deception Detection Based on Ensemble Learning

Zhang Hu, Tan Hongye, Qian Yuhua, Li Ru, Chen Qian   

  1. (School of Computer & Information Technology, Shanxi University, Taiyuan 030006)
  • Online: 2015-05-01

摘要: 欺骗信息检测是信息安全领域中的重要研究内容.现有的研究表明,三分之一的人际交往中会涉及到潜在的欺骗,大量的欺骗信息充斥在各种各样的通信媒介中,在海量的网络信息中欺骗性数据的规模通常远小于非欺骗性数据的规模,已有方法还不能很好地适应于准确高效地欺骗检测,迫切期望提出一种能高效地检测欺骗信息的方法.针对具有非平衡性的海量网络信息,提出了一种基于集成学习的欺骗行为检测方法.通过改进的二分k-means划分方法对训练样本集进行分解,分别在每对正负样本集上学习各自独立的分类器,然后利用每个独立分类器分别计算待测样本的类别输出值,并采用结合个体分类器分类正确率的最小最大模块化方法集成每个判别结果.实验结果验证了该方法的有效性.

关键词: 欺骗, 欺骗检测, 集成学习, 样本划分, 最小最大模块化支持向量机

Abstract: Deception detection is important in the field of information security. Existing researches show that one third of the interpersonal communication involves the potential deceptions, and there are large amounts of deceptive messages in the more and more Web information. If the deception is potentially dangerous to people's life, the survival of enterprise and the stability of the country, then the negligence of deception may lead to incalculable loss. In the massive amounts of information the scale of the non-deceptive texts is much larger than the scale of the deceptive texts, so people remain unsuccessful and inefficient in detecting those deceptive messages by the existing methods, and it is desirable to create an automated method which could help people flag the possible deceptive messages. In this paper, we built a deception detection model based on ensemble learning to solve the imbalance of the existing data sets. Firstly a novel bisecting k-means method is proposed to cut the training sample set, and the separate classifiers are trained by using each pair of positive and negative samples, and then each test sample category value is calculated by the classifiers, and finally a novel min-max modular approach is used to integrate each category result. Experimental results verify the effectiveness of this method.

Key words: deception, deception detection, ensemble learning, cutting samples, min-max modular support vector machine (M3-SVM)

中图分类号: