Abstract:
In order to resolve the conflicts between low computational complexity and high classification accuracy in email spam filtering algorithms, a structured ensemble learning idea is proposed within the multi-field learning framework, which combines the results of multiple base classifiers according to documental structures to pursue higher classification performance. Multiple light base classifiers are generated by string features of email documents, and a string-frequency index is used to store labeled data, which conduces to that the time cost of each updating or each searching is a constant level. According to the multi-field feature of email documents, two linear combination weights are proposed separately based on historical classification effectiveness of field classifiers and current classification contribution of field documents. Considering the historical classification effectiveness of field classifiers and the current classification contribution of field documents, a compound linear combination weight is proposed, which can improve overall classification accuracy. The experimental results on the TREC spam filtering task of immediate full feedback show that the compound-linear-combination-weight-based structured ensemble learning method can complete the filtering task in high speed (47.24 min), whose overall performance 1-ROCA is comparable to the best one (0.0055) among the participators in the TREC 2007 spam track.