ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2016, Vol. 53 ›› Issue (8): 1684-1695.doi: 10.7544/issn1000-1239.2016.20160172

所属专题: 2016数据挖掘前沿技术专题

• 人工智能 • 上一篇    下一篇

一种基于关联信息熵度量的特征选择方法

董红斌,滕旭阳,杨雪   

  1. (哈尔滨工程大学计算机科学与技术学院 哈尔滨 150001) (donghongbin@hrbeu.edu.cn)
  • 出版日期: 2016-08-01
  • 基金资助: 
    国家自然科学基金项目(61472095,61502116);黑龙江省教育厅智能教育与信息工程重点实验室开放基金项目

Feature Selection Based on the Measurement of Correlation Information Entropy

Dong Hongbin, Teng Xuyang,Yang Xue   

  1. (College of Computer Science and Technology, Harbin Engineering University, Harbin 150001)
  • Online: 2016-08-01

摘要: 特征选择旨在从原始集合中选择一个规模较小的特征子集,该子集能够在数据挖掘和机器学习任务中提供与原集合近似或者更好的表现.在不改变特征物理意义的基础上,较少特征为数据提供了更强的可解读性.传统信息论方法往往将特征相关性和冗余性分割判断,无法判断整个特征子集的组合效应.将数据融合领域中的关联信息熵理论应用到特征选择中,基于该方法度量特征间的独立和冗余程度.利用特征与类别的互信息与特征对组合构建特征相关矩阵,在计算矩阵特征值时充分考虑了特征子集中不同特征间的多变量关系.提出了特征排序方法,并结合参数分析提出一种自适应的特征子集选择方法.实验结果表明所提方法在分类任务中的有效性和高效性.

关键词: 特征选择, 联合信息熵, 组合效应, 多变量关系, 相关矩阵

Abstract: Feature selection aims to select a smaller feature subset from the original feature set. The subset can provide the approximate or better performance in data mining and machine learning. Without transforming physical characteristics of features, fewer features give a more powerful interpretation. Traditional information-theoretic methods tend to measure features relevance and redundancy separately and ignore the combination effect of the whole feature subset. In this paper, the correlation information entropy is applied to feature selection, which is a technology in data fusion. Based on this method, we measure the degree of the independence and redundancy among features. Then the correlation matrix is constructed by utilizing the mutual information between features and their class labels and the combination of feature pairs. Besides, with the consideration of the multivariable correlation of different features in subset, the eigenvalue of the correlation matrix is calculated. Therefore, the sorting algorithm of features and an adaptive feature subset selection algorithm combining with the parameter are proposed. Experiment results show the effectiveness and efficiency on classification tasks of the proposed algorithms.

Key words: feature selection, correlation information entropy, group effect, multivariable correlation, correlation matrix

中图分类号: