Abstract:
Class imbalance is one of the problems plagueing practitioners in data mining community. First, some strategies to deal with this problem are reviewed. When training set is skewed, the popular kNN text classifier will mislabel instances in rare categories into common ones and lead to degradation in macro F1. To alleviate such a misfortune, a novel concept, critical point (CP) of the text training set, is proposed. Then property of CP is explored and algorithm evaluating the lower approximation (LA) and upper approximation (UA) of CP is given. Afterwards, traditional kNN is adapted by integrating LA or UA, training number with decision functions. This version of kNN is called self-adaptive kNN classifier with weight adjustment. To verify self-adaptive kNN classifier with weight adjustment feasible, two groups of experiments are carried out to compare with it. The first group is to compare the performance of different shrink factors, which can be viewed as comparing with Tan's work, and to prove that at LA or UA, the classifier will exhibit better Macro F1. The second group is to compare with random-sampling, where traditional kNN is used as a baseline. Experiments on four corpora illustrate that self-adaptive kNN text classifier with weight adjustment is better than random re-sampling, improving macro F1 evidently. The proposed method is similar to cost-sensitive learning to some extent.