Abstract:
Text categorization (TC) is an important research direction in Text Mining. It aims to assign one or more predefined category label(s) for a text document, and provides efficient methods for documents management and information searching. A major problem in automatic text categorization is how to select the best feature subset from the original high feature space in order to make the categorization algorithm work efficiently and improve the precision. In this paper, the methods of feature selection and weight adjustment techniques are discussed and analyzed, and their influence on text classification precision and efficiency is pointed out. Furthermore, the TEF-WA (term evaluation function-weight adjustment) is introduced. We introduce a new weight function, which includes feature weight evaluation function and adjusts the effect of the feature term in the classifier according to the feature term's strength. To evaluate the TEF-WA method, experiments are carried by using several different scale training document collection, various term evaluation functions such as document frequency, information gain, expected cross entropy, CHI, the weight of evidence for text, term frequency formula or document frequency formula. The experiment results have proved that the TEF-WA technique is efficient in promoting the classification precision and reducing the compute complexity.