• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Xu Yan, Li Jintao, Wang Bin, Sun Chunming, Zhang Sen. A Study on Constraints for Feature Selection in Text Categorization[J]. Journal of Computer Research and Development, 2008, 45(4): 596-602.
Citation: Xu Yan, Li Jintao, Wang Bin, Sun Chunming, Zhang Sen. A Study on Constraints for Feature Selection in Text Categorization[J]. Journal of Computer Research and Development, 2008, 45(4): 596-602.

A Study on Constraints for Feature Selection in Text Categorization

More Information
  • Published Date: April 14, 2008
  • Text categorization (TC) is the process of grouping texts into one or more predefined categories based on their content. Due to the increased availability of documents in digital form and the rapid growth of online information, TC has become a key technique for handling and organizing text data. One of the most important issues in TC is feature selection (FS). Many FS methods have been put forward and widely used in the TC field, such as information gain (IG), document frequency thresholding (DF) and mutual information. Empirical studies show that some of these (e.g. IG, DF) produce better categorization performance than others (e.g. MI). A basic research question is why these FS methods cause different performance. Many existing works seek to answer this question based on empirical studies. In this paper, a theoretical performance evaluation function for FS methods is put forward in text categorization, Some basic desirable constraints that any reasonable FS function should satisfy are defind and then these constraints on some popular FS methods are checked, including IG, DF and MI. It is found that IG satisfies these constraints, and that there are strong statistical correlations between DF and the constraints, whilst MI does not satisfy the constraints. Experimental results on Reuters 21578 and OHSUMED corpora show that the empirical performance of a feature selection method is tightly related to how well it satisfies these constraints.
  • Related Articles

    [1]Zhang Shuyi, Xi Zhengjun. Quantum Hypothesis Testing Mutual Information[J]. Journal of Computer Research and Development, 2021, 58(9): 1906-1914. DOI: 10.7544/issn1000-1239.2021.20210346
    [2]Chu Xiaokai, Fan Xinxin, Bi Jingping. Position-Aware Network Representation Learning via K-Step Mutual Information Estimation[J]. Journal of Computer Research and Development, 2021, 58(8): 1612-1623. DOI: 10.7544/issn1000-1239.2021.20210321
    [3]Xu Mengfan, Li Xinghua, Liu Hai, Zhong Cheng, Ma Jianfeng. An Intrusion Detection Scheme Based on Semi-Supervised Learning and Information Gain Ratio[J]. Journal of Computer Research and Development, 2017, 54(10): 2255-2267. DOI: 10.7544/issn1000-1239.2017.20170456
    [4]Zha Zhengjun, Zheng Xiaoju. Query and Feedback Technologies in Multimedia Information Retrieval[J]. Journal of Computer Research and Development, 2017, 54(6): 1267-1280. DOI: 10.7544/issn1000-1239.2017.20170004
    [5]Li Feng, Miao Duoqian, Zhang Zhifei, Zhang Wei. Mutual Information Based Granular Feature Weighted k-Nearest Neighbors Algorithm for Multi-Label Learning[J]. Journal of Computer Research and Development, 2017, 54(5): 1024-1035. DOI: 10.7544/issn1000-1239.2017.20160351
    [6]Xue Yuanhai, Yu Xiaoming, Liu Yue, Guan Feng, Cheng Xueqi. Exploration of Weighted Proximity Measure in Information Retrieval[J]. Journal of Computer Research and Development, 2014, 51(10): 2216-2224. DOI: 10.7544/issn1000-1239.2014.20130339
    [7]Zhang Zhenhai, Li Shining, Li Zhigang, and Chen Hao. Multi-Label Feature Selection Algorithm Based on Information Entropy[J]. Journal of Computer Research and Development, 2013, 50(6): 1177-1184.
    [8]Xu Junling, Zhou Yuming, Chen Lin, Xu Baowen. An Unsupervised Feature Selection Approach Based on Mutual Information[J]. Journal of Computer Research and Development, 2012, 49(2): 372-382.
    [9]Liu He, Zhang Xianghong, Liu Dayou, Li Yanjun, Yin Lijun. A Feature Selection Method Based on Maximal Marginal Relevance[J]. Journal of Computer Research and Development, 2012, 49(2): 354-360.
    [10]Wang Wenhui, Feng Qianjin, Chen Wufan. Segmentation of Brain MR Images Based on the Measurement of Difference of Mutual Information and Gauss-Markov Random Field Model[J]. Journal of Computer Research and Development, 2009, 46(3): 521-527.

Catalog

    Article views (897) PDF downloads (634) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return