Abstract:
Mining useful knowledge from gene expression data is a hot research topic and direction of bioinformatics. Gene microarray data are characterized by high dimensionality, small sample size and high redundancy. Therefore, a gene selection method based on the intersection neighborhood rough set is presented to select important genes for the classification of microarray data. First, pathway knowledge is used to preselect genes, and each pathway unit is corresponding to a gene subset. Then the attribute reduction method based on rough set is applied to select important genes without redundancy for classification. Due to the large number of pathway knowledge units, many base classifiers are generated. In order to further improve the diversity among base classifiers and the efficiency of ensemble model, it is necessary to select part of base classifiers. Affinity propagation (AP) clustering needn’t to set the number of clusters and the starting points, and it can obtain clusters more quickly and accurately. Therefore, AP clustering algorithm is used to group base classifiers into many clusters with significant diversity among them, then selecting a classifier from each cluster to generate the final ensemble classifier. Experimental results on three Arabidopsis thaliana biotic and abiotic stress response datasets show that the proposed method can improve the accuracy compared with the existing ensemble methods by more than 12%.