Abstract:
Co-training algorithm is constrained by the assumption that the features can be split into two subsets which are both compatible and independent. However, the assumption is usually violated to some degree in real-world application. The authors propose two methods to evaluate the mutual independence utilizing conditional mutual information or conditional CHI statistics, and present a method to construct a mutual independence model (MID-Model)for initial features set. Based on MID-Model, two novel feature partition algorithms PMID-MI and PMID-CHI are developed. The former utilizes conditional mutual information to evaluate the mutual independence between two features; the latter utilizes conditional CHI statistics. As a result, a feature set can be divided into two conditional independent subsets using PMID-MI or PMID-CHI. Compared with the random splitting method, both PMID-MI and PMID-CHI accomplish better performance. In addition, the conditional independence between two subsets is verified by several diversity measures such as Q statistic, correlation coefficient ρ, disagreement, double fault, and the integrative measure DM. Then, combining MID-Model and diversity measures, an improved semi-supervised categorization algorithm named SC-PMID is developed. Two classifiers can be co-trained on a pair of independent subsets. The independence of two subsets can reduce the chance of both classifiers agreeing on erroneous label of an unlabeled example. Experimental results show that the SC-PMID algorithm can significantly improve the semi-supervised categorization precision.