Abstract:
Microarray data has been widely and successfully applied to cancer classification, where the purpose is to classify and predict the diagnostic category of a sample by its gene expression profile. A typical microarray dataset consists of expression levels for a large number (usually thousands or ten thousands) of genes on a relatively small number (often less than one hundred) of samples. Of the tens of thousands of genes, only a small number of them are contributing to cancer classification. As a consequence, one basic and important question associated with cancer classification is to identify a small subset of informative genes contributing the most to the classification task. This procedure is usually called gene selection. Gene selection has been widely studied in statistical pattern recognition, machine learning and data mining. The authors attempt to review the field of gene selection based on their earlier work, introduce the background and the two basic concepts (gene relevance, relevance measure) of gene selection, categorize the existing gene selection methods from statistics, machine learning and data mining areas, demonstrate the performance of several representative gene selection algorithms through an empirical study using public microarray data, identify the existing problems of gene selection, and point out current trends and feature directions.