A Study on Constraints for Feature Selection in Text Categorization

Xu Yan; Li Jintao; Wang Bin; Sun Chunming; Zhang Sen

Xu Yan, Li Jintao, Wang Bin, Sun Chunming, Zhang Sen. A Study on Constraints for Feature Selection in Text CategorizationJ. Journal of Computer Research and Development, 2008, 45(4): 596-602.

Citation:

Xu Yan, Li Jintao, Wang Bin, Sun Chunming, Zhang Sen. A Study on Constraints for Feature Selection in Text CategorizationJ. Journal of Computer Research and Development, 2008, 45(4): 596-602.

Citation:

Xu Yan, Li Jintao, Wang Bin, Sun Chunming, Zhang Sen. A Study on Constraints for Feature Selection in Text CategorizationJ. Journal of Computer Research and Development, 2008, 45(4): 596-602.

A Study on Constraints for Feature Selection in Text Categorization

Graphical Abstract

Graphical Abstract

Abstract

Abstract

Text categorization (TC) is the process of grouping texts into one or more predefined categories based on their content. Due to the increased availability of documents in digital form and the rapid growth of online information, TC has become a key technique for handling and organizing text data. One of the most important issues in TC is feature selection (FS). Many FS methods have been put forward and widely used in the TC field, such as information gain (IG), document frequency thresholding (DF) and mutual information. Empirical studies show that some of these (e.g. IG, DF) produce better categorization performance than others (e.g. MI). A basic research question is why these FS methods cause different performance. Many existing works seek to answer this question based on empirical studies. In this paper, a theoretical performance evaluation function for FS methods is put forward in text categorization, Some basic desirable constraints that any reasonable FS function should satisfy are defind and then these constraints on some popular FS methods are checked, including IG, DF and MI. It is found that IG satisfies these constraints, and that there are strong statistical correlations between DF and the constraints, whilst MI does not satisfy the constraints. Experimental results on Reuters 21578 and OHSUMED corpora show that the empirical performance of a feature selection method is tightly related to how well it satisfies these constraints.

FullText(HTML)

References (0)

Cited By

Turn off MathJax

Article Contents

A Study on Constraints for Feature Selection in Text Categorization

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content