• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Yue Wenjing, Qu Wenwen, Lin Kuan, Wang Xiaoling. Survey of Cardinality Estimation Techniques Based on Machine Learning[J]. Journal of Computer Research and Development, 2024, 61(2): 413-427. DOI: 10.7544/issn1000-1239.202220649
Citation: Yue Wenjing, Qu Wenwen, Lin Kuan, Wang Xiaoling. Survey of Cardinality Estimation Techniques Based on Machine Learning[J]. Journal of Computer Research and Development, 2024, 61(2): 413-427. DOI: 10.7544/issn1000-1239.202220649

Survey of Cardinality Estimation Techniques Based on Machine Learning

Funds: This work was supported by the National Key Research and Development Program of China (2021YFC3340700), the Key Program of the National Natural Science Foundation of China (62136002), the Science and Technology Commission of Shanghai Municipality (20DZ1100300), and the National Natural Science Foundation of China (61972155).
More Information
  • Author Bio:

    Yue Wenjing: born in 1996. PhD candidate. Student member of CCF. Her main research interests include database, cardinality estimation techniques, machine learning, and recommender system

    Qu Wenwen: born in 1995. PhD candidate. His main research interests include database, graph processing, and distributed graph mining system. (wenwenqu@stu.ecnu.edu.cn)

    Lin Kuan: born in 1982. Master, associate professor. Her main research interest includes signal and information processing. (linkuan@aircas.ac.cn)

    Wang Xiaoling: born in 1975. PhD, professor, PhD supervisor. Senior member of CCF. Her main research interests include knowledge graph, personalized recommendation, and privacy protection

  • Received Date: July 23, 2022
  • Revised Date: April 18, 2023
  • Available Online: November 13, 2023
  • Cardinality estimation is the basis and core of query optimizer for the database management system (DBMS). With the development of artificial intelligence (AI) technology, AI technology has shown superior performance in data processing and extracting the relationship from the data. In recent years, the research of the cardinality estimation method based on machine learning has made significant progress and received wide attention from the academic community. Firstly, we introduce the technical background and development status of cardinality estimation methods based on machine learning. Secondly, we give the definition and the feature encoding technology of the related concepts of cardinality estimation. Then, we expound on the classification structure of cardinality estimation technology from two aspects: traditional cardinality estimation and cardinality estimation based on machine learning. Then, we further subdivide cardinality estimation based on machine learning into three types of cardinality estimation techniques: query-driven, data-driven, and hybrid models. Then, we focus on analyzing the modeling flow, typical methodologies, and characteristics of each type of model. In addition, we analyze and summarize the application of cardinality estimation in SQL and NoSQL. Finally, we discuss the challenges and future research directions on cardinality estimation methods based on machine learning.

  • [1]
    Lakshmi S, Zhou Shaoyu. Selectivity estimation in extensible databases−A neural network approach[C] //Proc of the 24th Int Conf on Very Large Data Bases. San Francisco, CA: Morgan Kaufmann, 1998: 623−627
    [2]
    Lu Hongju, Setiono R. Effective query size estimation using neural networks[J]. Applied Intelligence, 2002, 16(3): 173−183 doi: 10.1023/A:1014333932021
    [3]
    Ortiz J, Balazinska M, Gehrke J, et al. Learning state representations for query optimization with deep reinforcement learning[C/OL] //Proc of the 2nd Workshop on Data Management for End-to-End Machine Learning. New York: ACM. 2018[2023-03-08]. https://dl.acm.org/doi/pdf/10.1145/3209889.3209890
    [4]
    Woltmann L, Hartmann C, Thiele M, et al. Cardinality estimation with local deep learning models[C/OL] //Proc of the 2nd Int Workshop on Exploiting Artificial Intelligence Techniques for Data Management. New York: ACM, 2019 [2023-03-08]. https://dl.acm.org/doi/10.1145/3329859.3329875
    [5]
    Kipf A, Kipf T, Radke B, et al. Learned cardinalities: Estimating correlated joins with deep learning[C/OL] //Proc of the 9th Biennial Conf on Innovative Data Systems Research. 2019[2023-03-08].https://www.cidrdb.org/cidr2019/papers/p101-kipf-cidr19.pdf
    [6]
    Ortiz J, Balazinska M, Gehrke J, et al. An empirical analysis of deep learning for cardinality estimation[J]. arXiv preprint, arXiv: 1905. 06425, 2019
    [7]
    Malik T, Burns R C, Chawla N V. A black-box approach to query cardinality estimation[C/OL] //Proc of the 3rd Biennial Conf on Innovative Data Systems Research. 2007[2023-03-08]. https://www.cidrdb.org/cidr2007/papers/cidr07p06.pdf
    [8]
    Sun Ji, Li Guoliang. An end-to-end learning-based cost estimator[J]. Proceedings of the VLDB Endowment, 2019, 13(3): 307−319 doi: 10.14778/3368289.3368296
    [9]
    Dutt A, Wang Chi, Nazi A, et al. Selectivity estimation for range predicates using lightweight models[J]. Proceedings of the VLDB Endowment, 2019, 12(9): 1044−1057 doi: 10.14778/3329772.3329780
    [10]
    Marcus R, Papaemmanouil O. Towards a hands-free query optimizer through deep learning[J]. arXiv preprint, arXiv: 1809. 10212, 2018
    [11]
    Liu H, Xu Mingbin, Yu Ziting, et al. Cardinality estimation using neural networks[C] //Proc of the 25th Annual Int Conf on Computer Science and Software Engineering. New York: ACM, 2019: 53−59
    [12]
    Liu Jie, Dong Wenqian, Li Dong, et al. Fauce: Fast and accurate deep ensembles with uncertainty for cardinality estimation[J]. Proceedings of the VLDB Endowment, 2021, 14(11): 1950−1963 doi: 10.14778/3476249.3476254
    [13]
    Marcus R, Kipf A, Van Renen A, et al. Flow-loss: Learning cardinality estimates that matter[J]. Proceedings of the VLDB Endowment, 2021, 14(11): 2019−2032 doi: 10.14778/3476249.3476259
    [14]
    Tzoumas K, Deshpande A, Jensen C S. Lightweight graphical models for selectivity estimation without independence assumptions[J]. Proceedings of the VLDB Endowment, 2011, 4(11): 852−863 doi: 10.14778/3402707.3402724
    [15]
    Heimel M, Kiefer M, Markl V. Self-tuning, GPU-accelerated kernel density models for multidimensional selectivity estimation[C] //Proc of the 2015 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2015: 1477−1492
    [16]
    Park Y, Zhong Shucheng, Mozafari B. QuickSel: Quick selectivity learning with mixture models[C] //Proc of the 2020 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2020: 1017−1033
    [17]
    Yang Zongheng, Liang E, Kamsetty A, et al. Deep unsupervised cardinality estimation[J]. Proceedings of the VLDB Endowment, 2019, 13(3): 279−292 doi: 10.14778/3368289.3368294
    [18]
    Hasan S, Thirumuruganathan S, Augustine J, et al. Deep learning models for selectivity estimation of multi-attribute queries[C] //Proc of the 2020 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2020: 1035−1050
    [19]
    Yang Zongheng, Kamsetty A, Luan Sifei, et al. NeuroCard: One cardinality estimator for all tables[J]. Proceedings of the VLDB Endowment, 2021, 14(1): 61−73
    [20]
    Hilprecht B, Schmidt A, Kulessa M, et al. DeepDB: Learn from data, not from queries![J]. Proceedings of the VLDB Endowment, 2020, 13(7): 992−1005 doi: 10.14778/3384345.3384349
    [21]
    Zhu Rong, Wu Ziniu, Han Yuxing, et al. FLAT: Fast, lightweight and accurate method for cardinality estimation[J]. arXiv preprint, arXiv: 2011. 09022, 2020
    [22]
    Lan Hai, Bao Zhifeng, Peng Yuwei. A survey on advancing the dbms query optimizer: Cardinality estimation, cost model, and plan enumeration[J]. Data Science and Engineering, 2021, 6(1): 86−101 doi: 10.1007/s41019-020-00149-7
    [23]
    Sun Ji, Zhang Jintao, Sun Zhaoyan, et al. Learned cardinality estimation: A design space exploration and a comparative evaluation[J]. Proceedings of the VLDB Endowment, 2021, 15(1): 85−97 doi: 10.14778/3485450.3485459
    [24]
    Wang Xiaoying, Qu Changbo, Wu Weiyuan, et al. Are we ready for learned cardinality estimation?[J]. Proceedings of the VLDB Endowment, 2021, 14(9): 1640−1654 doi: 10.14778/3461535.3461552
    [25]
    李国良,周煊赫,孙佶,等. 基于机器学习的数据库技术综述[J]. 计算机学报,2020,43(11):2019−2049

    Li Guoliang, Zhou Xuanhe, Sun Ji, et al. Survey on database technology based on machine learning[J]. Chinese Journal of Computers, 2020, 43(11): 2019−2049 (in Chinese)
    [26]
    Leis V, Gubichev A, Mirchev A, et al. How good are query optimizers, really?[J]. Proceedings of the VLDB Endowment, 2015, 9(3): 204−215 doi: 10.14778/2850583.2850594
    [27]
    Sun Ji, Li Guoliang, Tang Nan. Learned cardinality estimation for similarity queries[C] // Proc of the 2021 Int Conf on Management of Data. New York: ACM, 2021: 1745−1757
    [28]
    Dutt A, Wang Chi, Narasayya V, et al. Efficiently approximating selectivity functions using low overhead regression models[J]. Proceedings of the VLDB Endowment, 2020, 13(11): 2215−2228
    [29]
    Acharya J, Diakonikolas I, Hegde C, et al. Fast and near-optimal algorithms for approximating distributions by histograms[C] //Proc of the 34th ACM SIGMOD-SIGACT-SIGAI Symp on Principles of Database Systems. New York: ACM, 2015: 249−263
    [30]
    Xu Jia, Zhang Zhenjie, Xiao Xiaokui, et al. Differentially private histogram publication[J]. The VLDB Journal, 2013, 22(6): 797−822 doi: 10.1007/s00778-013-0309-y
    [31]
    Ioannidis Y. The history of histograms (abridged)[C] //Proc of the 29th Int Conf on Very Large Data Bases. San Francisco, CA: Morgan Kaufmann, 2003: 19−30
    [32]
    Tschorsch F, Scheuermann B. An algorithm for privacy-preserving distributed user statistics[J]. Computer Networks, 2013, 57(14): 2775−2787 doi: 10.1016/j.comnet.2013.05.011
    [33]
    Flajolet P, Éric F, Gandouet O, et al. HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm[J]. Analysis of Algorithms , 2007: 137−156
    [34]
    Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments[J]. Journal of Computer and System Sciences, 1999, 58(1): 137−147 doi: 10.1006/jcss.1997.1545
    [35]
    Durand M, Flajolet P. Loglog counting of large cardinalities[C] //Proc of the 11th Annual European Symp. Berlin: Springer, 2003: 605−617
    [36]
    Wu Wentao, Naughton J F, Singh H. Sampling-based query re-optimization[C] //Proc of the 2016 Int Conf on Management of Data. New York: ACM, 2016: 1721−1736
    [37]
    Chen Yun, Yi Ke. Two-level sampling for join size estimation[C] // Proc of the 2017 ACM Int Conf on Management of Data. New York: ACM, 2017: 759−774
    [38]
    Leis V, Radke B, Gubichev A, et al. Cardinality estimation done right: Index-based join sampling[C/OL] //Proc of the 8th Biennial Conf on Innovative Data Systems Research. 2017[2023-03-08].https://www.cidrdb.org/cidr2017/papers/p9-leis-cidr17.pdf
    [39]
    Lipton R J, Naughton J F, Schneider D A. Practical selectivity estimation through adaptive sampling[C] //Proc of the 1990 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 1990: 1−11
    [40]
    Olken F, Rotem D. Random sampling from database files: A survey[C] //Proc of the 5th Int Conf on Scientific and Statistical Database Management. Berlin: Springer, 1990: 92−111
    [41]
    Hasan R. Predicting sparql query performance and explaining linked data[C] //Proc of the 11th Int Conf on Semantic Web: Trends and Challenges. Berlin: Springer, 2014: 795−805
    [42]
    Zhang W E, Sheng Q Z, Qin Yongrui, et al. Learning-based SPARQL query performance modeling and prediction[J]. World Wide Web, 2018, 21(4): 1015−1035 doi: 10.1007/s11280-017-0498-1
    [43]
    Neumann T, Weikum G. The RDF-3X engine for scalable management of RDF data[J]. The VLDB Journal, 2010, 19(1): 91−113 doi: 10.1007/s00778-009-0165-y
    [44]
    Stocker M, Seaborne A, Bernstein A, et al. SPARQL basic graph pattern optimization using selectivity estimation[C] //Proc of the 17th Int Conf on World Wide Web. New York: ACM, 2008: 595−604
    [45]
    Vidal M E, Ruckhaus E, Lampo T, et al. Efficiently joining group patterns in SPARQL queries[C] //Proc of the 7th Extended Semantic Web Conf. Berlin: Springer, 2010: 228−242
    [46]
    Huang Hai, Liu Chengfei. Estimating selectivity for joined RDF triple patterns[C] //Proc of the 20th ACM Int Conf on Information and Knowledge Management. New York: ACM, 2011: 1435−1444
    [47]
    Davitkova A, Gjurovski D, Michel S. LMKG: Learned models for cardinality estimation in knowledge Graphs[J]. arXiv preprint, arXiv: 2102. 10588, 2021
    [48]
    Halford M, Saint-Pierre P, Morvan F. An approach based on Bayesian networks for query selectivity estimation[C] // Proc of the 24th Int Conf on Database Systems for Advanced Applications. Berlin: Springer, 2019: 3−19
    [49]
    Liu Qiyu, Shen Yanyan, Chen Lei. LHist: Towards learning multi-dimensional histogram for massive spatial data[C] // Proc of the 37th IEEE Int Conf on Data Engineering. Piscataway, NJ: IEEE, 2021: 1188−1199
    [50]
    Kraska T, Beutel A, Chi E H, et al. The case for learned index structures[C] //Proc of the 2018 Int Conf on Management of Data. New York: ACM, 2018: 489−504
    [51]
    Muralikrishna M, Dewitt D J. Equi-depth histograms for estimating selectivity factors for multi-dimensional[C] //Proc of the 1988 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 1988: 28−36
    [52]
    Wu Peizhi, Cong Gao. A unified deep model of learning from both data and queries for cardinality estimation[C] //Proc of the 2021 Int Conf on Management of Data. New York: ACM, 2021: 2009−2022
  • Related Articles

    [1]Qian Zhongsheng, Huang Heng, Zhu Hui, Liu Jinping. Multi-Perspective Graph Contrastive Learning Recommendation Method with Layer Attention Mechanism[J]. Journal of Computer Research and Development, 2025, 62(1): 160-178. DOI: 10.7544/issn1000-1239.202330804
    [2]Zhang Jinyu, Ma Chenxi, Li Chao, Zhao Zhongying. Towards Lightweight Cross-Domain Sequential Recommendation via Tri-Branches Graph External Attention Network[J]. Journal of Computer Research and Development, 2024, 61(8): 1930-1944. DOI: 10.7544/issn1000-1239.202440197
    [3]Xie Jun, Wang Yuzhu, Chen Bo, Zhang Zehua, Liu Qin. Aspect-Based Sentiment Analysis Model with Bi-Guide Attention Network[J]. Journal of Computer Research and Development, 2022, 59(12): 2831-2843. DOI: 10.7544/issn1000-1239.20210708
    [4]Qian Zhongsheng, Yang Jiaxiu, Li Duanming, Ye Zulai. Event Recommendation Strategy Combining User Long-Short Term Interest and vent Influence[J]. Journal of Computer Research and Development, 2022, 59(12): 2803-2815. DOI: 10.7544/issn1000-1239.20210693
    [5]Sun Qian, Xue Leiqi, Gao Ling, Wang Hai, Wang Yuxiang. Selection of Network Defense Strategies Based on Stochastic Game and Tabu Search[J]. Journal of Computer Research and Development, 2020, 57(4): 767-777. DOI: 10.7544/issn1000-1239.2020.20190870
    [6]Xu Jinghang, Zuo Wanli, Liang Shining, Wang Ying. Causal Relation Extraction Based on Graph Attention Networks[J]. Journal of Computer Research and Development, 2020, 57(1): 159-174. DOI: 10.7544/issn1000-1239.2020.20190042
    [7]Sun Xiaowan, Wang Ying, Wang Xin, Sun Yudong. Aspect-Based Sentiment Analysis Model Based on Dual-Attention Networks[J]. Journal of Computer Research and Development, 2019, 56(11): 2384-2395. DOI: 10.7544/issn1000-1239.2019.20180823
    [8]Zhang Han, Guo Yuanbo, Li Tao. Domain Named Entity Recognition Combining GAN and BiLSTM-Attention-CRF[J]. Journal of Computer Research and Development, 2019, 56(9): 1851-1858. DOI: 10.7544/issn1000-1239.2019.20180733
    [9]Guo Chi, Wang Lina, Guan Yiping, Zhang Xiaoying. A Network Immunization Strategy Based on Dynamic Preference Scan[J]. Journal of Computer Research and Development, 2012, 49(4): 717-724.
    [10]Wang Bailing, Fang Binxing, Yun Xiaochun, Zhang Hongli, Chen Bo, Liu Yixuan. A New Friendly Worm Propagation Strategy Based on Diffusing Balance Tree[J]. Journal of Computer Research and Development, 2006, 43(9): 1593-1602.
  • Cited by

    Periodical cited type(6)

    1. 韩宇捷,徐志杰,杨定裕,黄波,郭健美. CDES:数据驱动的云数据库效能评估方法. 计算机科学. 2024(06): 111-117 .
    2. 刘传磊,张贺,杨贺. 地铁保护区智能化巡查系统开发及应用研究. 现代城市轨道交通. 2024(09): 23-30 .
    3. 董文,张俊峰,刘俊,张雷. 国产数据库在能源数字化转型中的创新应用研究. 信息通信技术与政策. 2024(10): 68-74 .
    4. 阎开. 计算机检测维修与数据恢复技术及应用研究. 信息记录材料. 2023(08): 89-91 .
    5. 冯丽琴,冯花平. 基于人脸识别的可控化学习数据库系统设计. 数字通信世界. 2023(10): 69-71 .
    6. 张惠芹,章小卫,杜坤,李江. 基于数字孪生的高校实验室高温设备智能化监管体系的探究. 实验室研究与探索. 2023(11): 249-252+282 .

    Other cited types(11)

Catalog

    Article views (343) PDF downloads (140) Cited by(17)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return