结构化集成学习垃圾邮件过滤

刘伍颖; 王  挺

结构化集成学习垃圾邮件过滤

刘伍颖,
王挺

Structured Ensemble Learning for Email Spam Filtering

摘要

摘要: 为了解决垃圾邮件过滤算法低计算复杂度与高分类准确率之间的矛盾,在多域学习框架下提出一种结构化集成学习思想,它根据文档结构组合多个基分类器的结果以追求更高分类性能.采用邮件文档的字符串特征生成多个轻量基分类器,并采用字符串-频率索引存储标注数据,使得每次更新和查询的时间开销是常数量级.根据邮件文档的多域结构特性,提出历史域分类器效力线性组合权和当前域文档分类能力线性组合权.综合考虑历史域分类器效力和当前域文档分类能力,还提出一种能够提高整体分类准确率的综合线性组合权.在TREC立即全反馈垃圾邮件过滤任务上的实验结果表明:基于综合线性组合权的结构化集成学习方法能够在较短的时间(47.24 min)内完成过滤任务,整体性能1-ROCA达到参加TREC2007评测的最优过滤器性能(0.0055).

Abstract: In order to resolve the conflicts between low computational complexity and high classification accuracy in email spam filtering algorithms, a structured ensemble learning idea is proposed within the multi-field learning framework, which combines the results of multiple base classifiers according to documental structures to pursue higher classification performance. Multiple light base classifiers are generated by string features of email documents, and a string-frequency index is used to store labeled data, which conduces to that the time cost of each updating or each searching is a constant level. According to the multi-field feature of email documents, two linear combination weights are proposed separately based on historical classification effectiveness of field classifiers and current classification contribution of field documents. Considering the historical classification effectiveness of field classifiers and the current classification contribution of field documents, a compound linear combination weight is proposed, which can improve overall classification accuracy. The experimental results on the TREC spam filtering task of immediate full feedback show that the compound-linear-combination-weight-based structured ensemble learning method can complete the filtering task in high speed (47.24 min), whose overall performance 1-ROCA is comparable to the best one (0.0055) among the participators in the TREC 2007 spam track.

HTML全文

参考文献(0)

施引文献

资源附件(0)