A Data Anonymization Approach Based on Impurity Gain and Hierarchical Clustering
-
Graphical Abstract
-
Abstract
Data anonymization is one of the important solutions to preserve privacy in data publishing. The basic concept of data anonymization and the application models are introduced, and the requirements that an anonymized dataset should meet are discussed. To resist the background knowledge attack, a new data anonymization approach based on impurity gain and hierarchical clustering is brought out. The impurity of a cluster is used to measure the randomicity of sensitive attributes, and the clusters' combination process is controlled by the restrictions that the information loss caused by generalization should be minimized and the impurity gain should be maximized. With the method, the anonymization results of a dataset can meet the requirements of k-anonymity model and l-diversity model, meanwhile, the information loss is minimized and the values of the sensitive attributes in each cluster has a uniform distribution. An evaluation method is provided in the experiment section, which compares anonymized dataset with the original one to evaluate the quality by calculating the average information loss and impurity. The experimental results validate the availability of the method.
-
-