ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2022, Vol. 59 ›› Issue (1): 182-196.doi: 10.7544/issn1000-1239.20200701

• 隐私保护 • 上一篇    下一篇

基于属性分割的高维二值数据差分隐私发布

洪金鑫1,吴英杰1,蔡剑平2,孙岚1   

  1. 1(福州大学数学与计算机科学学院 福州 350108);2(厦门华厦学院信息与智能机电工程学院 福建厦门 361024) (fzu_hjx@163.com)
  • 出版日期: 2022-01-01
  • 基金资助: 
    福建省自然科学基金项目(2017J01754,2018J01797)

Differentially Private High-Dimensional Binary Data Publication via Attribute Segmentation

Hong Jinxin1, Wu Yingjie1, Cai Jianping2, Sun Lan1   

  1. 1(College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350108);2(College of Information and Smart Electromechanical Engineering, Xiamen Huaxia University, Xiamen, Fujian 361024)
  • Online: 2022-01-01
  • Supported by: 
    This work was supported by the Natural Science Foundation of Fujian Province of China (2017J01754, 2018J01797).

摘要: 通常随着数据集属性维度的增加,高维数据的差分隐私发布方法所需的时间成本和产生的噪声干扰也会随之增大,尤其是对于高维二值数据很容易被过大的噪声所覆盖.因此,针对高维二值数据的隐私发布问题,提出了一种高效且低噪的发布方法PrivSCBN(differentially private spectral clustering Bayesian network).首先,该方法基于Jaccard距离,使用满足差分隐私的谱聚类算法来划分属性集,然后根据划分的结果来进一步分割原始数据集,从而实现数据的降维.其次,该方法基于动态规划思想并结合指数机制,使用满足差分隐私的贝叶斯网络快速构建算法来为每个分割后的子集构建贝叶斯网络.最后,该方法利用条件概率在二值数据上的取值特点,对从贝叶斯网络中提取的条件分布进行加噪,并通过控制贝叶斯网络的最大入度数来减少其产生的噪声大小.通过在3个真实高维二值数据集上的实验,验证了PrivSCBN方法的高效性与可用性.

关键词: 差分隐私, 高维二值数据发布, 贝叶斯网络, 属性划分, 动态规划, 条件分布

Abstract: Generally, as the attribute dimension of the data set increases, the time cost and noise interference generated by the differential privacy publishing method of high-dimensional data will also increase. Especially for high-dimensional binary data, it is easy to be covered by excessive noise. Therefore, an efficient and low-noise publishing method PrivSCBN(differentially private spectral clustering Bayesian network) is proposed for the issue of privacy publishing of high-dimensional binary data. Firstly, based on Jaccard distance, this method uses a spectral clustering algorithm which satisfies differential privacy to divide the attributes set, and further segments the original data set, so as to achieve dimension reduction. Secondly, based on the idea of dynamic programming and combined with the exponential mechanism, this method uses a fast building Bayesian network algorithm which satisfies differential privacy to construct Bayesian network for each subset after segmentation. Finally, this method uses the value characteristic of conditional probability on binary data to add noise to conditional distribution extracted from Bayesian network, and reduces the noise by controlling the maximum in-degrees of Bayesian network. The efficiency and availability of the PrivSCBN method are verified by experiments on three real high-dimensional binary data sets.

Key words: differential privacy, high-dimensional binary data publication, Bayesian network, attribute division, dynamic programming, conditional distribution

中图分类号: