Abstract:
The problem of binary classification with imbalanced data appears in many fields and is still not completely solved. In addition to predicting the classification label directly, many applications also care about the probability that data belongs to a certain class. However, much of the existing research is mainly focused on the classification performance but neglects the probability estimation. The aim of this paper is to improve the performance of class probability estimation (CPE) and ensure the classification performance. A new approach of regression is proposed by adopting the generalized linear model as the basic framework and using the calibration loss function as the objective optimization function. Considering the asymmetry and the flexibility of the generalized extreme value (GEV) distribution, we use it to formulate the link function, which contributes to binary classification with imbalanced data. As to the model estimation, because of the significant influence of the shape parameter on modeling precision, two methods to estimate the shape parameter in GEV distribution are proposed. Experiments on synthetic datasets prove the accuracy of the shape parameter estimation. Besides, experimental results on real data suggest that our proposed approach, compared with other three commonly used regression algorithms, performs well on the classification performance as well as CPE. In addition, the proposed algorithm also outperforms other optimization algorithms in terms of the computational efficiency.