高级检索

    一种基于广义极值分布的非平衡数据分类算法

    A GEV-Based Classification Algorithm for Imbalanced Data

    • 摘要: 在许多业务应用中,非平衡数据分类问题都会频繁出现,然而这个问题仍未得到很好的解决.除了直接预测数据对应的分类标签,许多应用还可能关心这个预测的准确性有多少.然而,已有的许多研究都主要集中在分类准确度上而忽略分类概率预测值的准确度.为了解决这个问题,提出了一种新的线性回归算法,该算法在广义线性模型的框架下,结合广义极值(generalized extreme value, GEV)分布作为链接函数以及校准损失函数作为目标优化函数,形成凸优化问题,利用广义极值分布的非对称性解决非平衡数据分类问题.另外,由于广义极值分布的形状参数对建模精度有较大影响,还提出了2种参数寻优方法.在实验部分,人工数据集和真实数据集均表明所提算法有着优异的分类性能以及准确的分类概率预测.

       

      Abstract: The problem of binary classification with imbalanced data appears in many fields and is still not completely solved. In addition to predicting the classification label directly, many applications also care about the probability that data belongs to a certain class. However, much of the existing research is mainly focused on the classification performance but neglects the probability estimation. The aim of this paper is to improve the performance of class probability estimation (CPE) and ensure the classification performance. A new approach of regression is proposed by adopting the generalized linear model as the basic framework and using the calibration loss function as the objective optimization function. Considering the asymmetry and the flexibility of the generalized extreme value (GEV) distribution, we use it to formulate the link function, which contributes to binary classification with imbalanced data. As to the model estimation, because of the significant influence of the shape parameter on modeling precision, two methods to estimate the shape parameter in GEV distribution are proposed. Experiments on synthetic datasets prove the accuracy of the shape parameter estimation. Besides, experimental results on real data suggest that our proposed approach, compared with other three commonly used regression algorithms, performs well on the classification performance as well as CPE. In addition, the proposed algorithm also outperforms other optimization algorithms in terms of the computational efficiency.

       

    /

    返回文章
    返回