Abstract:
With the rapid growth of textual information on the Internet, text categorization has already been one of the key research directions in data mining. Text categorization is a supervised learning process, defined as automatically distributing free text into one or more predefined categories. At the present, text categorization is necessary for managing textual information and has been applied into many fields. However, text categorization has two characteristics: high dimensionality of feature space and high level of feature redundancy. For the two characteristics, χ\+2 is used to deal with high dimensionality of feature space, and information novelty is used to deal with high level of feature redundancy. According to the definition of maximal marginal relevance, a feature selection method based on maximal marginal relevance is proposed, which can reduce redundancy between features in the process of feature selection. Furthermore, the experiments are carried out on two text data sets, namely, Reuters-21578 Top10 and OHSCAL. The results indicate that the feature selection method based on maximal marginal relevance is more efficient than χ\+〗2 and information gain. Moveover it can improve the performance of three different categorizers, namely, nave Bayes, Rocchio and kNN.