Abstract:
The t-closeness model is an effective model to prevent the data sets from skewness attack and similarity attack. But the EMD (earth mover's distance), which t-closeness used to measure the distance between distributions, is not well considering the stability between distributions, so it is hardly to entirely measure the distance between distributions. When the stability between distributions is too large, it will greatly increase the risk of privacy. Aim to address these limitations and accurately measure the distance between distributions, based on traditional t-closeness, the model of SABuk t-closeness which combined the EMD with KL divergence to construct a new distance measurement is proposed. At the same time, according to the hierarchy of sensitive attribute (SA), it partitions a table into buckets based on the semantic similarity of SA values, and then uses greedy algorithm for generating the minimum groups which is satisfied with the requirement of the distance between distributions. At the end, it has adopted the k-nearest neighbour algorithm to choose similar quasi-identifiers (QI) values. Experimental results indicate that SABuk t-closeness model can bring down the information loss on the premise of consuming a little time, and it can preserve privacy of sensitive data well meanwhile maintaining high data utility.