Abstract:
Text categorization is one of the key research fields in text mining. Feature weighting is an important problem in text categorization. For computing feature weights, a feature weighting scheme for text categorization is proposed. In this scheme, the feature importance is defined based on the real rough set theory. By this concept, decision-making information of a feature for categorization is introduced into the weight of this feature. Then, the experiments are performed on two international and standard text datasets, namely, Reuters-21578 Top10 and WebKB. Through the computation of the total within-class scatter and between-class scatter in Fisher linear discriminant, it is verified that the proposed scheme can decrease the total within-class scatter and increase the between-class scatter; that is to say, the scheme can make samples in the same class more compact and those in different classes looser for the two datasets. Thereby, the proposed scheme can improve the space distribution of samples and simplify the mapping relation from samples to classes. Finally, the proposed scheme is evaluated on the two datasets by Nave Bayes, kNN and SVM classifiers. The experimental results show that the scheme can enhance the precision, recall and the value of F\-1 for categorization.