一种基于最大边缘相关的特征选择方法

刘  赫; 张相洪; 刘大有; 李燕军; 尹立军

一种基于最大边缘相关的特征选择方法

A Feature Selection Method Based on Maximal Marginal Relevance

摘要

摘要: 文本分类的特点是高维的特征空间和高度的特征冗余.针对这两个特点,采用χ\+2统计量处理高维的特征空间,利用信息新颖度的思想处理高度的特征冗余,根据最大边缘相关的定义,将二者有机结合,提出一种基于最大边缘相关的特征选择方法.该方法可以在特征选择过程中减少大量的冗余特征.最后,在Reuters-21578 Top10和OHSCAL两个文本数据集上进行实验.实验结果表明,基于最大边缘相关的特征选择方法比χ\+2统计量和信息增益两种特征选择方法更高效,并且能够提高nave Bayes,Rocchio和kNN 3种不同分类器的性能.

Abstract: With the rapid growth of textual information on the Internet, text categorization has already been one of the key research directions in data mining. Text categorization is a supervised learning process, defined as automatically distributing free text into one or more predefined categories. At the present, text categorization is necessary for managing textual information and has been applied into many fields. However, text categorization has two characteristics: high dimensionality of feature space and high level of feature redundancy. For the two characteristics, χ\+2 is used to deal with high dimensionality of feature space, and information novelty is used to deal with high level of feature redundancy. According to the definition of maximal marginal relevance, a feature selection method based on maximal marginal relevance is proposed, which can reduce redundancy between features in the process of feature selection. Furthermore, the experiments are carried out on two text data sets, namely, Reuters-21578 Top10 and OHSCAL. The results indicate that the feature selection method based on maximal marginal relevance is more efficient than χ\+〗2 and information gain. Moveover it can improve the performance of three different categorizers, namely, nave Bayes, Rocchio and kNN.

HTML全文

参考文献(0)

施引文献

资源附件(0)