高级检索

    基于词频分类器集成的文本分类方法

    A Text Classification Method Based on Term Frequency Classifier Ensemble

    • 摘要: 提出了一种基于词频分类器集成的文本分类方法.词频分类器是在对文本中的单词和它在每个文本中出现的频率进行统计后得到的简单分类器.虽然词频分类器本身泛化能力不强,但它不仅计算代较小,而且在训练样本甚至类别增加时易于进行更新,而整个学习系统的泛化能力可以由集成学习机制来提高,因此,词频分类器很适合用做集成学习的基分类器.在集成时,使用了改进的AdaBoost算法,加入了一种强制重新分布权的机制,避免算法过早停止,更加适合文本分类任务.在标准文集Reuters-21578上的实验结果表明,该方法能取得很好的效果.

       

      Abstract: In this paper, a method of text classification based on term frequency classifier ensemble is proposed. Term frequency classifier is a kind of simple classifier obtained after calculating terms' frequency of texts in the corpus. Though the generalization ability of term frequency classifier is not strong enough, it is a qualified base learner for ensemble because of its low computational cost, flexibility in updating with new samples and classes, and the feasibility of improving generalization with the help of ensemble paradigms. An improved AdaBoost algorithm is used to build the ensemble, which employs a scheme of compulsive weights updating to avoid early stop. Therefore it is more suitable for text classification. Experimental results on the corpus of Reuters-21578 show that the proposed method can achieve good performance in text classification tasks.

       

    /

    返回文章
    返回