Citation: | Wang Xiujun, Mo Lei, Zheng Xiao, Wei Linna, Dong Jun, Liu Zhi, Guo Longkun. Sampling Based Fast Publishing Algorithm with Differential Privacy for Data Stream[J]. Journal of Computer Research and Development, 2024, 61(10): 2433-2447. DOI: 10.7544/issn1000-1239.202440481 |
Many cloud native database applications need to handle massive data streams. To analyze group trend information in these data streams in real time without compromising individual user privacy, these applications require the capability to quickly create differentially private histograms for the most recent dataset at any given moment. However, existing histogram publishing methods lack efficient data structures, making it difficult to rapidly extract key information to ensure real-time data usability. To address this issue, we deeply analyze the relationship between data sampling and privacy protection, and propose a sampling based fast publishing algorithm with differential privacy for data stream (SPF). SPF introduces an efficient data stream sampling sketch structure (EDS) for the first time, which samples and statistically estimates data within a sliding window and filters out unreasonable data, enabling rapid extraction of key information. Then, we demonstrate that the approximations output by the EDS structure are theoretically equivalent to adding differential privacy noise to the true values. Finally, to meet the privacy protection strength provided by the user while reflecting the true situation of the original data stream, an adaptive noise addition algorithm based on efficient data stream sampling is proposed. According to the relationship between the user-provided privacy protection strength and the privacy protection strength provided by the EDS structure, the algorithm adaptively generates the final publishable histogram through privacy allocation. Experiments show that compared with existing algorithms, SPF significantly reduces time and space overhead while maintaining the same data usability.
[1] |
董昊文,张超,李国良,等. 云原生数据库综述[J]. 软件学报,2023,35(2):899−926
Dong Haowen, Zhang Chao, Li Guoliang, et al. Survey on cloudnative databases[J]. Journal of Software, 2023, 35(2): 899−926 (in Chinese)
|
[2] |
Zhao Zhanhao, Pan Hexiang, Chen Gang, et al. VeriTxn: Verifiable transactions for cloud-native databases with storage disaggregation[J]. Proceedings of the ACM on Management of Data, 2023, 1(4): 1−27
|
[3] |
Papadogiannaki E, Ioannidis S. A survey on encrypted network traffic analysis applications, techniques, and countermeasures[J]. ACM Computing Surveys, 2021, 54(6): 1−35
|
[4] |
Shahraki A, Taherkordi A, Haugen Ø. TONTA: Trend-based online network traffic analysis in ad-hoc IoT networks[J]. Computer Networks, 2021, 194: 108125 doi: 10.1016/j.comnet.2021.108125
|
[5] |
Shahid M R, Blanc G, Zhang Z, et al. IoT devices recognition through network traffic analysis[C]//Proc of 2018 IEEE Int Conf on Big Data (Big Data). Piscataway, NJ: IEEE, 2018: 5187−5192
|
[6] |
Butilă E V, Boboc R G. Urban traffic monitoring and analysis using unmanned aerial vehicles (UAVs): A systematic literature review[J]. Remote Sensing, 2022, 14(3): 620 doi: 10.3390/rs14030620
|
[7] |
Jain N K, Saini R K, Mittal P. A review on traffic monitoring system techniques. Soft Computing: Theories and Applications[C]//Proc of SOCTA 2017, 2019: 569−577
|
[8] |
Figueiras P, Herga Z, Guerreiro G, et al. Real-time monitoring of road traffic using data stream mining[C]//Proc of 2018 IEEE Int Conf on Engineering, Technology and Innovation (ICE/ITMC). Piscataway, NJ: IEEE, 2018: 1−8
|
[9] |
Fang B, Zhang P. Big data in finance[J]. Big Data Concepts, Theories, and Applications, 2016: 391−412
|
[10] |
Fikri N, Rida M, Abghour N, et al. An adaptive and real-time based architecture for financial data integration[J]. Journal of Big Data, 2019, 6: 1−25 doi: 10.1186/s40537-018-0162-3
|
[11] |
Thennakoon A, Bhagyani C, Premadasa S, et al. Real-time credit card fraud detection using machine learning[C]//Proc of 2019 9th Int Conf on Cloud Computing, Data Science & Engineering (Confluence). Piscataway, NJ: IEEE, 2019: 488−493
|
[12] |
Upadhyay J. Sublinear space private algorithms under the sliding window model[C]//Proc of Int Conf on Machine Learning. New York: PMLR, 2019: 6363−6372
|
[13] |
Bassily R, Nissim K, Stemmer U, et al. Practical locally private heavy hitters[J]. The Journal of Machine Learning Research, 2020, 21(1): 535−576
|
[14] |
Li F. Cloud-native database systems at Alibaba: Opportunities and challenges[J]. Proceedings of the VLDB Endowment, 2019, 12(12): 2263−2272 doi: 10.14778/3352063.3352141
|
[15] |
李海翔,李晓燕,刘畅,等. 数据库管理系统中数据异常体系化定义与分类. 软件学报,2022,33(3):909–930
Li Haixiang, Li Xiaoyan, Liu Chang, et al. Systematic definition and classification of data anomalies in data base management systems. Journal of Software, 2022, 33(3): 909−930 (in Chinese)
|
[16] |
赵泓尧,赵展浩,杨皖晴,等. 内存数据库并发控制算法的实验研究. 软件学报,2022,33(3):867–890
Zhao Hongyao, Zhao Zhanhao, Yang Wanqing, et al. Experimental study on concurrency control algorithms in in-memory databases. Journal of Software, 2022, 33(3): 867−890 (in Chinese)
|
[17] |
Spillner J, Toffetti G, López M R. Cloud-native databases: An application perspective[C]//Advances in Service-Oriented and Cloud Computing: Workshops of ESOCC 2017. Berlin: Springer, 2018: 102−116
|
[18] |
Dwork C. Differential privacy[C]//Proc of Int Colloquium on Automata, Languages, and Programming. Berlin: Springer, 2006: 1−12
|
[19] |
Shahin V, Zhang Xinyao, Qiu Dongyu. Analysis and optimization of big-data stream processing[C]//Proc of 2016 IEEE Global Communications Conf (GLOBECOM). Piscataway, NJ: IEEE, 2016. Doi: 10.1109/GLOCOM.2016.7841598
|
[20] |
Bar-Yossef Z, Jayram T S, Kumar R, et al. Counting distinct elements in a data stream[C]//Proc of Int Workshop on Randomization and Approximation Techniques in Computer Science. Berlin: Springer, 2002: 1−10
|
[21] |
Danassis P, Triastcyn A, Faltings B. A distributed differentially private algorithm for resource allocation in unboundedly large settings[J]. arXiv preprint, arXiv: 2011.07934, 2020
|
[22] |
Mazmudar M, Humphries T, Liu J, et al. Cache me if you can: Accuracy-aware inference engine for differentially private data exploration[J]. arXiv preprint, arXiv: 2211.15732, 2022
|
[23] |
Chen Y, Al-Rubaye S, Tsourdos A, et al. Differentially-private federated intrusion detection via knowledge distillation in third-party IoT systems of smart airports[C]//Proc of IEEE Int Conf on Communications (ICC 2023). Piscataway, NJ: IEEE, 2023: 603−608
|
[24] |
Xu Jia, Zhang Zhejie, Xiao Xiaokui, et al. Differentially private histogram publication[J]. The VLDB Journal, 2013, 22(6): 797−822 doi: 10.1007/s00778-013-0309-y
|
[25] |
Zhang X, Chen R, Xu J, et al. Towards accurate histogram publication under differential privacy[C]//Proc of the 2014 SIAM Int Conf on Data Mining. Philadelphia, PA: Society for Industrial and Applied Mathematics, 2014: 587−595
|
[26] |
张啸剑,邵超,孟小峰. 差分隐私下一种精确直方图发布方法[J]. 计算机研究与发展,2016,53(5):1106−1117 doi: 10.7544/issn1000-1239.2016.20150304
Zhang Xiaojian, Shao Chao, Meng Xiaofeng. Accurate histogram release under differential privacy[J]. Journal of Computer Research and Development, 2016, 53(5): 1106−1117 (in Chinese) doi: 10.7544/issn1000-1239.2016.20150304
|
[27] |
唐海霞,杨庚,白云璐. 自适应差分隐私预算分配策略的直方图发布算法[J]. 计算机应用研究,2020,53(7):1952−1957, 1963
Tang Haixia, Yang Geng, Bai Yunlu. Histogram publishing algorithm based on adaptive privacy budget allocation strategy under differential privacy[J]. Application Research of Computers, 2020, 53(7): 1952−1957, 1963 (in Chinese)
|
[28] |
林富鹏,吴英杰,王一蕾,等. 差分隐私二维数据流统计发布[J]. 计算机应用,2015,35(1):88−92 doi: 10.11772/j.issn.1001-9081.2015.01.0088
Lin Fupeng, Wu Yingjie, Wang Yilei, et al. Differentially private statistical publication for two-dimensional data stream[J]. Journal of Computer Applications, 2015, 35(1): 88−92 (in Chinese) doi: 10.11772/j.issn.1001-9081.2015.01.0088
|
[29] |
张啸剑,孟小峰. 基于差分隐私的流式直方图发布方法[J]. 软件学报,2016,27(2):381−393
Zhang Xiaojian, Meng Xiaofeng. Streaming histogram publication method with differential privacy[J]. Journal of Software, 2016, 27(2): 381−393 (in Chinese)
|
[30] |
Sun L, Ge C, Huang X, et al. Differentially private real-time streaming data publication based on sliding window under exponential decay[J]. Computers, Materials and Continua, 2019, 58(1): 61−78 doi: 10.32604/cmc.2019.03744
|
[31] |
Wu X, Tong N, Ye Z, et al. Histogram publishing algorithm based on sampling sorting and greedy clustering[C]//Proc of Int Conf on Blockchain and Trustworthy Systems. Berlin: Springer, 2020: 81−91
|
[32] |
Liu X, Liu H. Data publication based on differential privacy In V2G network[J]. International Journal of Electronics Engineering and Applications, 2021, 9(2): 34−44
|
[33] |
Cardoso A R, Rogers R. Differentially private histograms under continual observation: Streaming selection into the unknown[C]//Proc of Int Conf on Artificial Intelligence and Statistics. New York: PMLR, 2022: 2397−2419
|
[34] |
Cao X, Cao Y, Pappachan P, et al. Differentially private streaming data release under temporal correlations via post-processing[C]//Proc of IFIP Annual Conf on Data and Applications Security and Privacy. Cham: Springer, 2023: 184−200
|
[35] |
Lebeda C J, Tetek J. Better differentially private approximate histograms and heavy hitters using the Misra-Gries sketch[C]//Proc of the 42nd ACM SIGMOD-SIGACT-SIGAI Symp on Principles of Database Systems. New York: ACM, 2023: 79−88
|
[36] |
Streeter M, Golovin D. An online algorithm for maximizing submodular functions[J]. Advances in Neural Information Processing Systems, 2008, 21
|
[37] |
Luo Ziyue, Wu Chuan. An online algorithm for VNF service chain scaling in datacenters[J]. IEEE/ACM Transactions on Networking, 2020, 28(3): 1061−1073 doi: 10.1109/TNET.2020.2979263
|
[38] |
Misra J, Gries D. Finding repeated elements[J]. Science of Computer Programming, 1982, 2(2): 143−152 doi: 10.1016/0167-6423(82)90012-0
|
[39] |
Chan T H H, Li M, Shi E, et al. Differentially private continual monitoring of heavy hitters from distributed streams[C]//Proc of the 12th Int Symp on Privacy Enhancing Technologies (PETS 2012), Berlin: Springer, 2012: 140−159
|
[1] | Cao Yiran, Zhu Youwen, He Xingyu, Zhang Yue. Utility-Optimized Local Differential Privacy Set-Valued Data Frequency Estimation Mechanism[J]. Journal of Computer Research and Development, 2022, 59(10): 2261-2274. DOI: 10.7544/issn1000-1239.20220504 |
[2] | Hong Jinxin, Wu Yingjie, Cai Jianping, Sun Lan. Differentially Private High-Dimensional Binary Data Publication via Attribute Segmentation[J]. Journal of Computer Research and Development, 2022, 59(1): 182-196. DOI: 10.7544/issn1000-1239.20200701 |
[3] | Wu Wanqing, Zhao Yongxin, Wang Qiao, Di Chaofan. A Safe Storage and Release Method of Trajectory Data Satisfying Differential Privacy[J]. Journal of Computer Research and Development, 2021, 58(11): 2430-2443. DOI: 10.7544/issn1000-1239.2021.20210589 |
[4] | Zhang Yuxuan, Wei Jianghong, Li Ji, Liu Wenfen, Hu Xuexian. Graph Degree Histogram Publication Method with Node-Differential Privacy[J]. Journal of Computer Research and Development, 2019, 56(3): 508-520. DOI: 10.7544/issn1000-1239.2019.20170886 |
[5] | Zhu Weijun, You Qingguang, Yang Weidong, Zhou Qinglei. Trajectory Privacy Preserving Based on Statistical Differential Privacy[J]. Journal of Computer Research and Development, 2017, 54(12): 2825-2832. DOI: 10.7544/issn1000-1239.2017.20160647 |
[6] | Wu Yingjie, Zhang Liqun, Kang Jian, Wang Yilei. An Algorithm for Differential Privacy Streaming Data Adaptive Publication[J]. Journal of Computer Research and Development, 2017, 54(12): 2805-2817. DOI: 10.7544/issn1000-1239.2017.20160555 |
[7] | Wang Liang, Wang Weiping, Meng Dan. Privacy Preserving Data Publishing via Weighted Bayesian Networks[J]. Journal of Computer Research and Development, 2016, 53(10): 2343-2353. DOI: 10.7544/issn1000-1239.2016.20160465 |
[8] | Lu Guoqing, Zhang Xiaojian, Ding Liping, Li Yanfeng, Liao Xin. Frequent Sequential Pattern Mining under Differential Privacy[J]. Journal of Computer Research and Development, 2015, 52(12): 2789-2801. DOI: 10.7544/issn1000-1239.2015.20140516 |
[9] | Ouyang Jia, Yin Jian, Liu Shaopeng, Liu Yubao. An Effective Differential Privacy Transaction Data Publication Strategy[J]. Journal of Computer Research and Development, 2014, 51(10): 2195-2205. DOI: 10.7544/issn1000-1239.2014.20130824 |
[10] | Ni Weiwei, Chen Geng, Chong Zhihong, Wu Yingjie. Privacy-Preserving Data Publication for Clustering[J]. Journal of Computer Research and Development, 2012, 49(5): 1095-1104. |