ISSN 1000-1239 CN 11-1777/TP

• 论文 •

### 类别不平衡的分类方法及在生物信息学中的应用

1. (哈尔滨工业大学计算机科学与技术学院 哈尔滨 150001) (zouquan@xmu.edu.cn)
• 出版日期: 2010-08-15

### A Classification Method for Class-Imbalanced Data and Its Application on Bioinformatics

Zou Quan, Guo Maozu, Liu Yang, and Wang Jun

1. (School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001)
• Online: 2010-08-15

Abstract: A classification method is proposed for class-imbalanced data, which is common in bioinformatics, such as identifying snoRNA, classifying microRNA precursors from pseudo ones, mining SNPs from EST sequences, etc. It is based on the main idea of ensemble learning. First, the big class set is divided randomly into several subsets equally, and it is made sure that every subset together with the small class set can make up a class-balanced training set. Then several different mechanism classifiers are selected and trained with these balanced training sets. After the multi-classifiers are built, they will vote for the last prediction when dealing with new samples. In the training phase, a strategy similar to AdaBoost is used. For each classifier, the samples will be added to the training sets of next two classifiers if they are misclassified. It is necessary to repeat modifying the training sets until a classifier can accurately predict its training set or reaching the maximum repeat times. This strategy can improve the performance of weak classifiers by voting. Experiments on five UCI data sets and three bioinformatics experiments mentioned above prove the performance of the method. Furthermore, a software program named LibID, which can be used as similarly as LibSVM, is developed for the researchers from bioinformatics and other fields.