Citation: | Yue Wenjing, Qu Wenwen, Lin Kuan, Wang Xiaoling. Survey of Cardinality Estimation Techniques Based on Machine Learning[J]. Journal of Computer Research and Development, 2024, 61(2): 413-427. DOI: 10.7544/issn1000-1239.202220649 |
Cardinality estimation is the basis and core of query optimizer for the database management system (DBMS). With the development of artificial intelligence (AI) technology, AI technology has shown superior performance in data processing and extracting the relationship from the data. In recent years, the research of the cardinality estimation method based on machine learning has made significant progress and received wide attention from the academic community. Firstly, we introduce the technical background and development status of cardinality estimation methods based on machine learning. Secondly, we give the definition and the feature encoding technology of the related concepts of cardinality estimation. Then, we expound on the classification structure of cardinality estimation technology from two aspects: traditional cardinality estimation and cardinality estimation based on machine learning. Then, we further subdivide cardinality estimation based on machine learning into three types of cardinality estimation techniques: query-driven, data-driven, and hybrid models. Then, we focus on analyzing the modeling flow, typical methodologies, and characteristics of each type of model. In addition, we analyze and summarize the application of cardinality estimation in SQL and NoSQL. Finally, we discuss the challenges and future research directions on cardinality estimation methods based on machine learning.
[1] |
Lakshmi S, Zhou Shaoyu. Selectivity estimation in extensible databases−A neural network approach[C] //Proc of the 24th Int Conf on Very Large Data Bases. San Francisco, CA: Morgan Kaufmann, 1998: 623−627
|
[2] |
Lu Hongju, Setiono R. Effective query size estimation using neural networks[J]. Applied Intelligence, 2002, 16(3): 173−183 doi: 10.1023/A:1014333932021
|
[3] |
Ortiz J, Balazinska M, Gehrke J, et al. Learning state representations for query optimization with deep reinforcement learning[C/OL] //Proc of the 2nd Workshop on Data Management for End-to-End Machine Learning. New York: ACM. 2018[2023-03-08]. https://dl.acm.org/doi/pdf/10.1145/3209889.3209890
|
[4] |
Woltmann L, Hartmann C, Thiele M, et al. Cardinality estimation with local deep learning models[C/OL] //Proc of the 2nd Int Workshop on Exploiting Artificial Intelligence Techniques for Data Management. New York: ACM, 2019 [2023-03-08]. https://dl.acm.org/doi/10.1145/3329859.3329875
|
[5] |
Kipf A, Kipf T, Radke B, et al. Learned cardinalities: Estimating correlated joins with deep learning[C/OL] //Proc of the 9th Biennial Conf on Innovative Data Systems Research. 2019[2023-03-08].https://www.cidrdb.org/cidr2019/papers/p101-kipf-cidr19.pdf
|
[6] |
Ortiz J, Balazinska M, Gehrke J, et al. An empirical analysis of deep learning for cardinality estimation[J]. arXiv preprint, arXiv: 1905. 06425, 2019
|
[7] |
Malik T, Burns R C, Chawla N V. A black-box approach to query cardinality estimation[C/OL] //Proc of the 3rd Biennial Conf on Innovative Data Systems Research. 2007[2023-03-08]. https://www.cidrdb.org/cidr2007/papers/cidr07p06.pdf
|
[8] |
Sun Ji, Li Guoliang. An end-to-end learning-based cost estimator[J]. Proceedings of the VLDB Endowment, 2019, 13(3): 307−319 doi: 10.14778/3368289.3368296
|
[9] |
Dutt A, Wang Chi, Nazi A, et al. Selectivity estimation for range predicates using lightweight models[J]. Proceedings of the VLDB Endowment, 2019, 12(9): 1044−1057 doi: 10.14778/3329772.3329780
|
[10] |
Marcus R, Papaemmanouil O. Towards a hands-free query optimizer through deep learning[J]. arXiv preprint, arXiv: 1809. 10212, 2018
|
[11] |
Liu H, Xu Mingbin, Yu Ziting, et al. Cardinality estimation using neural networks[C] //Proc of the 25th Annual Int Conf on Computer Science and Software Engineering. New York: ACM, 2019: 53−59
|
[12] |
Liu Jie, Dong Wenqian, Li Dong, et al. Fauce: Fast and accurate deep ensembles with uncertainty for cardinality estimation[J]. Proceedings of the VLDB Endowment, 2021, 14(11): 1950−1963 doi: 10.14778/3476249.3476254
|
[13] |
Marcus R, Kipf A, Van Renen A, et al. Flow-loss: Learning cardinality estimates that matter[J]. Proceedings of the VLDB Endowment, 2021, 14(11): 2019−2032 doi: 10.14778/3476249.3476259
|
[14] |
Tzoumas K, Deshpande A, Jensen C S. Lightweight graphical models for selectivity estimation without independence assumptions[J]. Proceedings of the VLDB Endowment, 2011, 4(11): 852−863 doi: 10.14778/3402707.3402724
|
[15] |
Heimel M, Kiefer M, Markl V. Self-tuning, GPU-accelerated kernel density models for multidimensional selectivity estimation[C] //Proc of the 2015 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2015: 1477−1492
|
[16] |
Park Y, Zhong Shucheng, Mozafari B. QuickSel: Quick selectivity learning with mixture models[C] //Proc of the 2020 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2020: 1017−1033
|
[17] |
Yang Zongheng, Liang E, Kamsetty A, et al. Deep unsupervised cardinality estimation[J]. Proceedings of the VLDB Endowment, 2019, 13(3): 279−292 doi: 10.14778/3368289.3368294
|
[18] |
Hasan S, Thirumuruganathan S, Augustine J, et al. Deep learning models for selectivity estimation of multi-attribute queries[C] //Proc of the 2020 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2020: 1035−1050
|
[19] |
Yang Zongheng, Kamsetty A, Luan Sifei, et al. NeuroCard: One cardinality estimator for all tables[J]. Proceedings of the VLDB Endowment, 2021, 14(1): 61−73
|
[20] |
Hilprecht B, Schmidt A, Kulessa M, et al. DeepDB: Learn from data, not from queries![J]. Proceedings of the VLDB Endowment, 2020, 13(7): 992−1005 doi: 10.14778/3384345.3384349
|
[21] |
Zhu Rong, Wu Ziniu, Han Yuxing, et al. FLAT: Fast, lightweight and accurate method for cardinality estimation[J]. arXiv preprint, arXiv: 2011. 09022, 2020
|
[22] |
Lan Hai, Bao Zhifeng, Peng Yuwei. A survey on advancing the dbms query optimizer: Cardinality estimation, cost model, and plan enumeration[J]. Data Science and Engineering, 2021, 6(1): 86−101 doi: 10.1007/s41019-020-00149-7
|
[23] |
Sun Ji, Zhang Jintao, Sun Zhaoyan, et al. Learned cardinality estimation: A design space exploration and a comparative evaluation[J]. Proceedings of the VLDB Endowment, 2021, 15(1): 85−97 doi: 10.14778/3485450.3485459
|
[24] |
Wang Xiaoying, Qu Changbo, Wu Weiyuan, et al. Are we ready for learned cardinality estimation?[J]. Proceedings of the VLDB Endowment, 2021, 14(9): 1640−1654 doi: 10.14778/3461535.3461552
|
[25] |
李国良,周煊赫,孙佶,等. 基于机器学习的数据库技术综述[J]. 计算机学报,2020,43(11):2019−2049
Li Guoliang, Zhou Xuanhe, Sun Ji, et al. Survey on database technology based on machine learning[J]. Chinese Journal of Computers, 2020, 43(11): 2019−2049 (in Chinese)
|
[26] |
Leis V, Gubichev A, Mirchev A, et al. How good are query optimizers, really?[J]. Proceedings of the VLDB Endowment, 2015, 9(3): 204−215 doi: 10.14778/2850583.2850594
|
[27] |
Sun Ji, Li Guoliang, Tang Nan. Learned cardinality estimation for similarity queries[C] // Proc of the 2021 Int Conf on Management of Data. New York: ACM, 2021: 1745−1757
|
[28] |
Dutt A, Wang Chi, Narasayya V, et al. Efficiently approximating selectivity functions using low overhead regression models[J]. Proceedings of the VLDB Endowment, 2020, 13(11): 2215−2228
|
[29] |
Acharya J, Diakonikolas I, Hegde C, et al. Fast and near-optimal algorithms for approximating distributions by histograms[C] //Proc of the 34th ACM SIGMOD-SIGACT-SIGAI Symp on Principles of Database Systems. New York: ACM, 2015: 249−263
|
[30] |
Xu Jia, Zhang Zhenjie, Xiao Xiaokui, et al. Differentially private histogram publication[J]. The VLDB Journal, 2013, 22(6): 797−822 doi: 10.1007/s00778-013-0309-y
|
[31] |
Ioannidis Y. The history of histograms (abridged)[C] //Proc of the 29th Int Conf on Very Large Data Bases. San Francisco, CA: Morgan Kaufmann, 2003: 19−30
|
[32] |
Tschorsch F, Scheuermann B. An algorithm for privacy-preserving distributed user statistics[J]. Computer Networks, 2013, 57(14): 2775−2787 doi: 10.1016/j.comnet.2013.05.011
|
[33] |
Flajolet P, Éric F, Gandouet O, et al. HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm[J]. Analysis of Algorithms , 2007: 137−156
|
[34] |
Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments[J]. Journal of Computer and System Sciences, 1999, 58(1): 137−147 doi: 10.1006/jcss.1997.1545
|
[35] |
Durand M, Flajolet P. Loglog counting of large cardinalities[C] //Proc of the 11th Annual European Symp. Berlin: Springer, 2003: 605−617
|
[36] |
Wu Wentao, Naughton J F, Singh H. Sampling-based query re-optimization[C] //Proc of the 2016 Int Conf on Management of Data. New York: ACM, 2016: 1721−1736
|
[37] |
Chen Yun, Yi Ke. Two-level sampling for join size estimation[C] // Proc of the 2017 ACM Int Conf on Management of Data. New York: ACM, 2017: 759−774
|
[38] |
Leis V, Radke B, Gubichev A, et al. Cardinality estimation done right: Index-based join sampling[C/OL] //Proc of the 8th Biennial Conf on Innovative Data Systems Research. 2017[2023-03-08].https://www.cidrdb.org/cidr2017/papers/p9-leis-cidr17.pdf
|
[39] |
Lipton R J, Naughton J F, Schneider D A. Practical selectivity estimation through adaptive sampling[C] //Proc of the 1990 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 1990: 1−11
|
[40] |
Olken F, Rotem D. Random sampling from database files: A survey[C] //Proc of the 5th Int Conf on Scientific and Statistical Database Management. Berlin: Springer, 1990: 92−111
|
[41] |
Hasan R. Predicting sparql query performance and explaining linked data[C] //Proc of the 11th Int Conf on Semantic Web: Trends and Challenges. Berlin: Springer, 2014: 795−805
|
[42] |
Zhang W E, Sheng Q Z, Qin Yongrui, et al. Learning-based SPARQL query performance modeling and prediction[J]. World Wide Web, 2018, 21(4): 1015−1035 doi: 10.1007/s11280-017-0498-1
|
[43] |
Neumann T, Weikum G. The RDF-3X engine for scalable management of RDF data[J]. The VLDB Journal, 2010, 19(1): 91−113 doi: 10.1007/s00778-009-0165-y
|
[44] |
Stocker M, Seaborne A, Bernstein A, et al. SPARQL basic graph pattern optimization using selectivity estimation[C] //Proc of the 17th Int Conf on World Wide Web. New York: ACM, 2008: 595−604
|
[45] |
Vidal M E, Ruckhaus E, Lampo T, et al. Efficiently joining group patterns in SPARQL queries[C] //Proc of the 7th Extended Semantic Web Conf. Berlin: Springer, 2010: 228−242
|
[46] |
Huang Hai, Liu Chengfei. Estimating selectivity for joined RDF triple patterns[C] //Proc of the 20th ACM Int Conf on Information and Knowledge Management. New York: ACM, 2011: 1435−1444
|
[47] |
Davitkova A, Gjurovski D, Michel S. LMKG: Learned models for cardinality estimation in knowledge Graphs[J]. arXiv preprint, arXiv: 2102. 10588, 2021
|
[48] |
Halford M, Saint-Pierre P, Morvan F. An approach based on Bayesian networks for query selectivity estimation[C] // Proc of the 24th Int Conf on Database Systems for Advanced Applications. Berlin: Springer, 2019: 3−19
|
[49] |
Liu Qiyu, Shen Yanyan, Chen Lei. LHist: Towards learning multi-dimensional histogram for massive spatial data[C] // Proc of the 37th IEEE Int Conf on Data Engineering. Piscataway, NJ: IEEE, 2021: 1188−1199
|
[50] |
Kraska T, Beutel A, Chi E H, et al. The case for learned index structures[C] //Proc of the 2018 Int Conf on Management of Data. New York: ACM, 2018: 489−504
|
[51] |
Muralikrishna M, Dewitt D J. Equi-depth histograms for estimating selectivity factors for multi-dimensional[C] //Proc of the 1988 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 1988: 28−36
|
[52] |
Wu Peizhi, Cong Gao. A unified deep model of learning from both data and queries for cardinality estimation[C] //Proc of the 2021 Int Conf on Management of Data. New York: ACM, 2021: 2009−2022
|
[1] | Qian Zhongsheng, Huang Heng, Zhu Hui, Liu Jinping. Multi-Perspective Graph Contrastive Learning Recommendation Method with Layer Attention Mechanism[J]. Journal of Computer Research and Development, 2025, 62(1): 160-178. DOI: 10.7544/issn1000-1239.202330804 |
[2] | Zhang Jinyu, Ma Chenxi, Li Chao, Zhao Zhongying. Towards Lightweight Cross-Domain Sequential Recommendation via Tri-Branches Graph External Attention Network[J]. Journal of Computer Research and Development, 2024, 61(8): 1930-1944. DOI: 10.7544/issn1000-1239.202440197 |
[3] | Xie Jun, Wang Yuzhu, Chen Bo, Zhang Zehua, Liu Qin. Aspect-Based Sentiment Analysis Model with Bi-Guide Attention Network[J]. Journal of Computer Research and Development, 2022, 59(12): 2831-2843. DOI: 10.7544/issn1000-1239.20210708 |
[4] | Qian Zhongsheng, Yang Jiaxiu, Li Duanming, Ye Zulai. Event Recommendation Strategy Combining User Long-Short Term Interest and vent Influence[J]. Journal of Computer Research and Development, 2022, 59(12): 2803-2815. DOI: 10.7544/issn1000-1239.20210693 |
[5] | Sun Qian, Xue Leiqi, Gao Ling, Wang Hai, Wang Yuxiang. Selection of Network Defense Strategies Based on Stochastic Game and Tabu Search[J]. Journal of Computer Research and Development, 2020, 57(4): 767-777. DOI: 10.7544/issn1000-1239.2020.20190870 |
[6] | Xu Jinghang, Zuo Wanli, Liang Shining, Wang Ying. Causal Relation Extraction Based on Graph Attention Networks[J]. Journal of Computer Research and Development, 2020, 57(1): 159-174. DOI: 10.7544/issn1000-1239.2020.20190042 |
[7] | Sun Xiaowan, Wang Ying, Wang Xin, Sun Yudong. Aspect-Based Sentiment Analysis Model Based on Dual-Attention Networks[J]. Journal of Computer Research and Development, 2019, 56(11): 2384-2395. DOI: 10.7544/issn1000-1239.2019.20180823 |
[8] | Zhang Han, Guo Yuanbo, Li Tao. Domain Named Entity Recognition Combining GAN and BiLSTM-Attention-CRF[J]. Journal of Computer Research and Development, 2019, 56(9): 1851-1858. DOI: 10.7544/issn1000-1239.2019.20180733 |
[9] | Guo Chi, Wang Lina, Guan Yiping, Zhang Xiaoying. A Network Immunization Strategy Based on Dynamic Preference Scan[J]. Journal of Computer Research and Development, 2012, 49(4): 717-724. |
[10] | Wang Bailing, Fang Binxing, Yun Xiaochun, Zhang Hongli, Chen Bo, Liu Yixuan. A New Friendly Worm Propagation Strategy Based on Diffusing Balance Tree[J]. Journal of Computer Research and Development, 2006, 43(9): 1593-1602. |