Abstract:
In data analysis, feature selection can be used to reduce the redundancy of features, improve the comprehensibility of models, and identify the hidden structures in high-dimensional data. In this paper, we propose a novel unsupervised feature selection approach based on mutual information called UFS-MI. In UFS-MI, we use a feature selection criterion, UmRMR, to evaluate the importance of each feature, which takes into account both relevance and redundancy. The relevance and redundancy respectively use mutual information to measure the dependence of features on the latent class and the dependence between features. In the new algorithm, features are selected or ranked in a stepwise way, one at a time, by estimating the capability of each specified candidate feature to decrease the uncertainty of other features (i.e. the capability of retaining the information contained in other features). The effectiveness of UFS-MI is confirmed by the theoretical proof which shows it can select features highly correlated with the latent class. An empirical comparison between UFS-MI and several traditional feature selection methods are also conducted on some popular data sets and the results show that UFS-MI can attain better or comparable performance and it is applicable to both numerical and non-numerical features.