差分隐私下的一种频繁序列模式挖掘方法
Frequent Sequential Pattern Mining under Differential Privacy
-
摘要: 频繁序列模式挖掘是数据挖掘领域的1个基本问题,然而模式本身及其支持度计数都有可能泄露用户隐私信息.差分隐私(differential privacy, DP)作为一种新出现的隐私保护技术,定义了一个相当严格的攻击模型,通过添加噪音使数据失真达到隐私保护的目的.由于序列数据内在序列性和高维度的特点,给差分隐私应用于频繁序列模式挖掘带来了挑战.对此提出了一种基于交互式差分隐私保护框架的频繁序列模式挖掘算法Diff-FSPM(differential-privacy frequent sequential pattern mining).该算法利用指数机制获取最优序列长度,并采用一种维规约策略获得原始序列数据集的规约表示,有效降低序列维度的影响;应用前缀树压缩频繁序列模式,利用拉普拉斯机制产生的噪音扰动频繁模式的真实支持度计数,同时采用闭频繁序列模式和Markov假设,有效分配隐私预算,并利用一致性约束后置处理,增强输出模式的可用性.理论角度证明算法满足ε-差分隐私,实验结果验证算法具有较好的可用性.Abstract: Frequent sequential pattern mining is an exploratory problem in the field of data mining. However, directly releasing the discovered frequent patterns and the corresponding true supports may reveal the individuals privacy. The state-of-the-art solution for this problem is differential privacy, which offers a strong degree of privacy protection by adding noise. Due to the inherent sequentiality and high-dimensionality in sequences, it is challenging to apply differential privacy to frequent sequential pattern mining. To address those problems, this paper presents a differentially private method called Diff-FSPM that uses an interactive way to find frequent patterns. To reduce the impact of dimensionality, Diff-FSPM first employs the exponential mechanism to obtain the optimal sequential length for truncating each sequence in the original dataset. After that, Diff-FSPM relies on a prefix-tree to compress all the frequent patterns, and then utilizes the Laplace mechanism to perturb the true support of the frequent patterns. To efficiently allocate the privacy budget, Diff-FSPM uses the closet frequent pattern and Markov assumption to guide the mining process. Finally, Diff-FSPM adopts a post-processing technique with consistency constraint to boost the accuracy of the returned noisy support counts. We theoretically prove that Diff-FSPM satisfies ε-differential privacy, and the experimental results show that it outperforms the existing methods in terms of utility.