Abstract:
Privacy-preserving data publication has attracted sustained attention in recent years. It seeks a trade-off between preserving data privacy and maintaining data utility. Clustering is a crucial step for advanced data analysis, which has been widely studied in data mining. There exists some inconsistency between clustering and data obfuscation. Process of clustering heavily depends on characteristics of individual records to segment data into different clusters. On the contrary, the process of data obfuscation usually adopts the idea of suppressing individual characteristics for the sake of avoiding leakage of individual privacy. It becomes difficult to accommodate data privacy and clustering utility of the published data simultaneously. Various distortion and limited distribution techniques are delved into this problem. The state-of-the-art of data obfuscation methods for clustering application is surveyed. The constraint mechanism among clustering character granularities to be kept, clustering usability maintenance and security of data privacy is discussed. Further, the principles and merits of some prevalent methods, such as data anonymity, data randomization, data swapping and synthetic data substitution, are compared from a view of accommodating data privacy preservation and clustering usability maintenance. Following a comprehensive analysis of the existing techniques, some unaddressed problems and future directions are highlighted.