ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2018, Vol. 55 ›› Issue (5): 986-993.doi: 10.7544/issn1000-1239.2018.20170077

• 人工智能 • 上一篇    下一篇

结合近邻传播聚类的选择性集成分类方法

孟军,张晶,姜丁菱,何馨宇,李丽双   

  1. (大连理工大学计算机科学与技术学院 辽宁大连 116023) (mengjun@dlut.edu.cn)
  • 出版日期: 2018-05-01
  • 基金资助: 
    国家自然科学基金项目(61472061,61672126)

Selective Ensemble Classification Integrated with Affinity Propagation Clustering

Meng Jun, Zhang Jing, Jiang Dingling, He Xinyu,Li Lishuang   

  1. (School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023)
  • Online: 2018-05-01

摘要: 从海量的基因微阵列数据中提取出有价值的信息是生物信息学的研究热点.基因微阵列数据具有高维度、小样本和高冗余的特性.因此,提出一种基于相交邻域粗糙集的基因选择方法,挑选出关键基因用于对微阵列数据进行分类.首先利用pathway知识进行基因初步选择,每个pathway单元对应一个基因子集,然后采用基于粗糙集的属性约简方法筛选出无冗余的关键基因.由于pathway知识单元的数量较多,对应生成大量的基分类器,为了进一步提高基分类器之间的差异性和集成的效率,对基分类器进行选择是十分必要的.近邻传播聚类不需要提前设定聚簇数量和起始点并且可以更快速、精确地进行聚类.因此,使用近邻传播聚类方法对基分类器进行分组,产生差异性较大的聚簇,再从每个簇中选择一个分类器构建集成分类器.在拟南芥的生物和非生物胁迫响应相关的微阵列数据集上的实验结果表明:在准确率方面,提出的方法与现有的集成方法相比最多可以提高12%.

关键词: 选择性集成, 近邻传播, 通路, 相交邻域, 基因微阵列数据

Abstract: Mining useful knowledge from gene expression data is a hot research topic and direction of bioinformatics. Gene microarray data are characterized by high dimensionality, small sample size and high redundancy. Therefore, a gene selection method based on the intersection neighborhood rough set is presented to select important genes for the classification of microarray data. First, pathway knowledge is used to preselect genes, and each pathway unit is corresponding to a gene subset. Then the attribute reduction method based on rough set is applied to select important genes without redundancy for classification. Due to the large number of pathway knowledge units, many base classifiers are generated. In order to further improve the diversity among base classifiers and the efficiency of ensemble model, it is necessary to select part of base classifiers. Affinity propagation (AP) clustering needn’t to set the number of clusters and the starting points, and it can obtain clusters more quickly and accurately. Therefore, AP clustering algorithm is used to group base classifiers into many clusters with significant diversity among them, then selecting a classifier from each cluster to generate the final ensemble classifier. Experimental results on three Arabidopsis thaliana biotic and abiotic stress response datasets show that the proposed method can improve the accuracy compared with the existing ensemble methods by more than 12%.

Key words: selective ensemble, affinity propagation, pathway, intersection neighborhood, gene microarray data

中图分类号: