一种基于关联信息熵度量的特征选择方法

董红斌; 滕旭阳; 杨雪

doi:10.7544/issn1000-1239.2016.20160172

一种基于关联信息熵度量的特征选择方法

Feature Selection Based on the Measurement of Correlation Information Entropy

摘要

摘要: 特征选择旨在从原始集合中选择一个规模较小的特征子集，该子集能够在数据挖掘和机器学习任务中提供与原集合近似或者更好的表现.在不改变特征物理意义的基础上，较少特征为数据提供了更强的可解读性.传统信息论方法往往将特征相关性和冗余性分割判断，无法判断整个特征子集的组合效应.将数据融合领域中的关联信息熵理论应用到特征选择中，基于该方法度量特征间的独立和冗余程度.利用特征与类别的互信息与特征对组合构建特征相关矩阵，在计算矩阵特征值时充分考虑了特征子集中不同特征间的多变量关系.提出了特征排序方法，并结合参数分析提出一种自适应的特征子集选择方法.实验结果表明所提方法在分类任务中的有效性和高效性.

Abstract: Feature selection aims to select a smaller feature subset from the original feature set. The subset can provide the approximate or better performance in data mining and machine learning. Without transforming physical characteristics of features, fewer features give a more powerful interpretation. Traditional information-theoretic methods tend to measure features relevance and redundancy separately and ignore the combination effect of the whole feature subset. In this paper, the correlation information entropy is applied to feature selection, which is a technology in data fusion. Based on this method, we measure the degree of the independence and redundancy among features. Then the correlation matrix is constructed by utilizing the mutual information between features and their class labels and the combination of feature pairs. Besides, with the consideration of the multivariable correlation of different features in subset, the eigenvalue of the correlation matrix is calculated. Therefore, the sorting algorithm of features and an adaptive feature subset selection algorithm combining with the parameter are proposed. Experiment results show the effectiveness and efficiency on classification tasks of the proposed algorithms.

HTML全文

参考文献(0)

施引文献

资源附件(0)