ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2021, Vol. 58 ›› Issue (3): 624-637.doi: 10.7544/issn1000-1239.2021.20200319

• 软件技术 • 上一篇    下一篇

基于直方图的隐私键-值数据收集算法

张啸剑1,徐雅鑫1,付楠1,孟小峰2   

  1. 1(河南财经政法大学计算机与信息工程学院 郑州 450002);2(中国人民大学信息学院 北京 100872) (xjzhang82@ruc.edu.cn)
  • 出版日期: 2021-03-01
  • 基金资助: 
    国家自然科学基金项目(61502146, 91646203, 91746115, 62072156);河南省自然科学基金项目(162300410006);河南省科技攻关项目(202102310563);河南财经政法大学青年拔尖人才资助计划项目

Towards Private Key-Value Data Collection with Histogram

Zhang Xiaojian1, Xu Yaxin1, Fu Nan1, Meng Xiaofeng2   

  1. 1(School of Computer & Information Engineering, Henan University of Economics and Law, Zhengzhou 450002);2(School of Information, Renmin University of China, Beijing 100872)
  • Online: 2021-03-01
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (61502146, 91646203, 91746115, 62072156), the Natural Science Foundation of Henan (162300410006), the Key Technologies Research and Development Program of Henan Province (202102310563), and the Young Talents Fund of Henan University of Economics and Law.

摘要: 基于本地差分隐私的用户数据收集与分析算法已延伸到了键-值数据类型.然而, 该类数据值域大小与稀疏性以及本地扰动机制直接制约着收集与分析精度.针对现有机制难以有效应对该类数据收集的不足, 提出了一种基于直方图技术的有效收集与分析算法HISKV(histogram-based key-value data collection), 该算法首先结合用户分组策略寻找最优截断长度, 利用最优截断-抽样技术处理值域过大与稀疏性问题, 然后结合截断结果随机抽取单个键-值对进行离散化处理.针对离散化结果, 设计一种高效的本地扰动机制LRR_KV(local random response for key-value data), 该机制结合具体的键分配不同的本地扰动概率.每个用户利用LRR_KV机制扰动离散化的键-值对之后发送给收集者, 收集者结合用户的报告值对每个键的频率及其值所对应的均值进行估计.理论分析了HISKV算法的无偏性、所产生的方差以及最大偏差, 并与现有的键-值收集算法在真实与合成的数据集上进行比较, 实验结果表明HISKV算法优于同类算法.

关键词: 本地差分隐私, 随机应答机制, 键-值数据, 频率估计, 均值估计

Abstract: Recently, user data collection and analysis with local differential privacy has extended into key-value data. The trade-off between the size and sparsity of domain and perturbation method directly constrains the accuracy of the collection and analysis of such data. To remedy the deficiency caused by the domain size and perturbating method, this paper employs histogram technology to propose an efficient solution, called HISKV, to collect key-value data. HISKV firstly uses a user-grouping strategy and partial privacy budget to find the optimal length of truncation and enables each user to truncate his/her key-value data set. And then, based on the truncated set, each user samples one key-value pair and uses the discretization and perturbation method to process this pair. To perturb key-value data efficiently, a novel mechanism in HISKV, named LRR_KV is proposed, which allocates different perturbing probability for different keys. In LRR_KV, each user adopts this mechanism to add noise to his/her sampled pair, and sents the report to a collector. Based on the reports from all of the users, the collector estimates the frequency of each key and the mean of the values. To evaluate the utility of HISKV, we firstly conduct theoretical analysis on unbias, variance, and error bound of LRR_KV, and then perform experiments on real and synthetic datasets to compare different methods. The experimental results show that HISKV outperforms its competitors.

Key words: local differential privacy, randomized response mechanism, key-value data, frequency estimation, mean estimation

中图分类号: