一种基于特征重要度的文本分类特征加权方法

刘  赫; 刘大有; 裴志利; 高  滢

一种基于特征重要度的文本分类特征加权方法

A Feature Weighting Scheme for Text Categorization Based on Feature Importance

摘要

摘要: 针对文本分类中的特征加权问题，提出了一种基于特征重要度的特征加权方法.该方法基于实数粗糙集理论，通过定义特征重要度，将特征对分类的决策信息引入到特征权重中.然后，在标准文本数据集Reuters-21578 Top10和WebKB上进行了实验.结果表明，该方法能改善样本空间的分布状态，使同类样本更加紧凑，异类样本更加松散，从而简化从样本到类别的映射关系.最后，使用Nave Bayes,kNN和SVM分类器在上述数据集上对该方法进行了实验.结果表明，该方法能提高分类的准确率、召回率和F\-1值.

Abstract: Text categorization is one of the key research fields in text mining. Feature weighting is an important problem in text categorization. For computing feature weights, a feature weighting scheme for text categorization is proposed. In this scheme, the feature importance is defined based on the real rough set theory. By this concept, decision-making information of a feature for categorization is introduced into the weight of this feature. Then, the experiments are performed on two international and standard text datasets, namely, Reuters-21578 Top10 and WebKB. Through the computation of the total within-class scatter and between-class scatter in Fisher linear discriminant, it is verified that the proposed scheme can decrease the total within-class scatter and increase the between-class scatter; that is to say, the scheme can make samples in the same class more compact and those in different classes looser for the two datasets. Thereby, the proposed scheme can improve the space distribution of samples and simplify the mapping relation from samples to classes. Finally, the proposed scheme is evaluated on the two datasets by Nave Bayes, kNN and SVM classifiers. The experimental results show that the scheme can enhance the precision, recall and the value of F\-1 for categorization.

HTML全文

参考文献(0)

施引文献

资源附件(0)