Research on the Algorithm of Feature Selection Based on Gini Index for Text Categorization
-
-
Abstract
With the rapid development of World Wide Web, large numbers of documents are available on the Internet. Automatic text categorization becomes more and more important for dealing with massive data. Text categorization has become a key technology in organizing and processing large amount of text data. For most classifiers using vector space model (VSM), text preprocessing has become the bottleneck of categorization. High dimensionality of the feature space is impossible for many classifiers. So adopting appropriate text feature selection algorithms to reduce the dimensionality of the feature space is becoming the key role. At present, there are many text feature selection algorithms. In this paper, all these text feature selection methods are not discussed in detail. but another new text feature selection method—Gini index is presented. Improved Gini-index is used for text feature selection, constructing the measure function based on Gini-index. The experiment results show that the text feature selection based on Gini index can improve the categorization performance further, and that its complexity of computing is small.
-
-