ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2019, Vol. 56 ›› Issue (12): 2578-2588.doi: 10.7544/issn1000-1239.2019.20180541

• 人工智能 • 上一篇    下一篇

EasiFFRA:一种基于邻域粗糙集的属性快速约简算法

王念1,2,彭政红1,2,崔莉1   

  1. 1(中国科学院计算技术研究所 北京 100190);2(中国科学院大学 北京 100190) (wangnian@ict.ac.cn)
  • 出版日期: 2019-12-01
  • 基金资助: 
    国家自然科学基金项目(61672498);国家重点研发计划项目(2016YFC0302300)

EasiFFRA: A Fast Feature Reduction Algorithm Based on Neighborhood Rough Set

Wang Nian1,2, Peng Zhenghong1,2, Cui Li1   

  1. 1(Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190);2(University of Chinese Academy of Sciences, Beijing 100190)
  • Online: 2019-12-01

摘要: 从高维异构感知信息中提取有效特征是支撑物联网系统预测与识别的基础.物联网场景中通常包括多个多种感知节点,系统通常会从感知数据中提取大量特征,其中不乏部分无关和冗余特征.这些无关及冗余特征会降低系统的运行速度,引入冗余计算,更会影响后续的分类及预测等机器学习操作的性能.因而高效识别并提取低维有效的特征子集是物联网数据分析所面临的一大挑战.邻域粗糙集方法能够在保持数据集可分性的前提下,识别和去除无关及冗余特征子集,从而达到降维效果.但由于现有基于邻域粗糙集的特征约简算法的计算开销大、运行时间长,故而并未得到广泛应用.提出了一种基于邻域关系对称性及决策值过滤策略的特征快速约简算法EasiFFRA.EasiFFRA可通过改进的散列分桶方法加速正域样本计算,可检验并过滤冗余决策值样本,从而降低现有方法中由于重复距离评估所带来的冗余计算.实验结果表明:EasiFFRA在实际采集的水质数据集和多个不同样本量及维度的公开数据集中平均加快75.45%的特征约简时间,其约简结果和已有邻域粗糙集特征约简算法等效,可有效解决物联网数据分析中由冗余及无关特征导致的分类及预测精度下降问题,有重要应用价值.

关键词: 邻域粗糙集, 特征约简, 对称机制, 过滤机制, 散列分桶

Abstract: Extracting effective features from the high-dimensional and heterogeneous feature set is significant, which is the basis for the prediction and classification of Internet of things (IoT) applications. There are usually multiple sensors deployed in the system and quite a few features are extracted to make full use of the environment information. The high dimensional features always contain redundant and unrelated features, which reduces not only the speed of system, but also the performance of the classification. It’s necessary to recognize and delete them. Neighborhood rough set (NRS) is a popular method for dimensionality reduction, which deletes the unrelated and redundant features while keeping the separability of dataset. However, the NRS method has not been widely applied because of the huge computing cost. In this paper, a Easi fast feature reduction algorithm (EasiFFRA) is proposed based on the symmetry of adjacent domain relationships and the decision attribute filtering mechanism, which reduces the redundant computing by preferentially traversing the buckets with relatively concentrated neighbor samples distribution, and stores the samples into a Hash table that cannot belong to the positive region under the current feature subset. Furthermore, this method can reduce the number of distance calculation significantly through filtering the samples which have the same label with the current sample. Moreover, the algorithm validity is verified by a real world dataset, and 12 open datasets are used. The results show that compared with FHARA, EasiFFRA reduces the computing time by 75.45%. EasiFFRA algorithm reduces the effect of unrelated and redundant features on the results of classification and prediction, and enhances the real-time performance of the neighborhood rough set based features reduction method, which has important application value.

Key words: neighborhood rough set, feature reduction, symmetry mechanism, filtration mechanism, Hash buckets

中图分类号: