Abstract:
In many real applications, it’s often difficult or quite expensive to get labeled negative examples for learning, such as Web search, medical diagnosis, earthquake identification and so on. This situation makes the traditional classification techniques work ineffectively, because the precondition that every class has to own its labeled instances is not met. Therefore, the semi-supervised learning method from positive and unlabeled data becomes a hot topic in the literature. In the past years, researchers have proposed many methods, but they can’t cope well with the imbalanced classification problem, especially when the number of hidden negative examples in the unlabeled set is relatively small or the distribution of training examples in the training set becomes quite different. In this paper, a novel KL divergence-based semi-supervised classification algorithm, named LiKL (i.e. semi-supervised learning algorithm from imbalanced data based on KL divergence), is proposed to tackle this special problem. The proposed approach firstly finds likely positive examples existing in the unlabeled set, and successively finds likely negative ones, followed by an enhanced logistic regression classifier to classify the unlabeled set. The experiments show that the proposed approach not only improves precision and recall, but also is very robust, compared with former work in the literature.