• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Yue Wenjing, Qu Wenwen, Lin Kuan, Wang Xiaoling. Survey of Cardinality Estimation Techniques Based on Machine Learning[J]. Journal of Computer Research and Development, 2024, 61(2): 413-427. DOI: 10.7544/issn1000-1239.202220649
Citation: Yue Wenjing, Qu Wenwen, Lin Kuan, Wang Xiaoling. Survey of Cardinality Estimation Techniques Based on Machine Learning[J]. Journal of Computer Research and Development, 2024, 61(2): 413-427. DOI: 10.7544/issn1000-1239.202220649

Survey of Cardinality Estimation Techniques Based on Machine Learning

Funds: This work was supported by the National Key Research and Development Program of China (2021YFC3340700), the Key Program of the National Natural Science Foundation of China (62136002), the Science and Technology Commission of Shanghai Municipality (20DZ1100300), and the National Natural Science Foundation of China (61972155).
More Information
  • Author Bio:

    Yue Wenjing: born in 1996. PhD candidate. Student member of CCF. Her main research interests include database, cardinality estimation techniques, machine learning, and recommender system

    Qu Wenwen: born in 1995. PhD candidate. His main research interests include database, graph processing, and distributed graph mining system. (wenwenqu@stu.ecnu.edu.cn)

    Lin Kuan: born in 1982. Master, associate professor. Her main research interest includes signal and information processing. (linkuan@aircas.ac.cn)

    Wang Xiaoling: born in 1975. PhD, professor, PhD supervisor. Senior member of CCF. Her main research interests include knowledge graph, personalized recommendation, and privacy protection

  • Received Date: July 23, 2022
  • Revised Date: April 18, 2023
  • Available Online: November 13, 2023
  • Cardinality estimation is the basis and core of query optimizer for the database management system (DBMS). With the development of artificial intelligence (AI) technology, AI technology has shown superior performance in data processing and extracting the relationship from the data. In recent years, the research of the cardinality estimation method based on machine learning has made significant progress and received wide attention from the academic community. Firstly, we introduce the technical background and development status of cardinality estimation methods based on machine learning. Secondly, we give the definition and the feature encoding technology of the related concepts of cardinality estimation. Then, we expound on the classification structure of cardinality estimation technology from two aspects: traditional cardinality estimation and cardinality estimation based on machine learning. Then, we further subdivide cardinality estimation based on machine learning into three types of cardinality estimation techniques: query-driven, data-driven, and hybrid models. Then, we focus on analyzing the modeling flow, typical methodologies, and characteristics of each type of model. In addition, we analyze and summarize the application of cardinality estimation in SQL and NoSQL. Finally, we discuss the challenges and future research directions on cardinality estimation methods based on machine learning.

  • [1]
    Lakshmi S, Zhou Shaoyu. Selectivity estimation in extensible databases−A neural network approach[C] //Proc of the 24th Int Conf on Very Large Data Bases. San Francisco, CA: Morgan Kaufmann, 1998: 623−627
    [2]
    Lu Hongju, Setiono R. Effective query size estimation using neural networks[J]. Applied Intelligence, 2002, 16(3): 173−183 doi: 10.1023/A:1014333932021
    [3]
    Ortiz J, Balazinska M, Gehrke J, et al. Learning state representations for query optimization with deep reinforcement learning[C/OL] //Proc of the 2nd Workshop on Data Management for End-to-End Machine Learning. New York: ACM. 2018[2023-03-08]. https://dl.acm.org/doi/pdf/10.1145/3209889.3209890
    [4]
    Woltmann L, Hartmann C, Thiele M, et al. Cardinality estimation with local deep learning models[C/OL] //Proc of the 2nd Int Workshop on Exploiting Artificial Intelligence Techniques for Data Management. New York: ACM, 2019 [2023-03-08]. https://dl.acm.org/doi/10.1145/3329859.3329875
    [5]
    Kipf A, Kipf T, Radke B, et al. Learned cardinalities: Estimating correlated joins with deep learning[C/OL] //Proc of the 9th Biennial Conf on Innovative Data Systems Research. 2019[2023-03-08].https://www.cidrdb.org/cidr2019/papers/p101-kipf-cidr19.pdf
    [6]
    Ortiz J, Balazinska M, Gehrke J, et al. An empirical analysis of deep learning for cardinality estimation[J]. arXiv preprint, arXiv: 1905. 06425, 2019
    [7]
    Malik T, Burns R C, Chawla N V. A black-box approach to query cardinality estimation[C/OL] //Proc of the 3rd Biennial Conf on Innovative Data Systems Research. 2007[2023-03-08]. https://www.cidrdb.org/cidr2007/papers/cidr07p06.pdf
    [8]
    Sun Ji, Li Guoliang. An end-to-end learning-based cost estimator[J]. Proceedings of the VLDB Endowment, 2019, 13(3): 307−319 doi: 10.14778/3368289.3368296
    [9]
    Dutt A, Wang Chi, Nazi A, et al. Selectivity estimation for range predicates using lightweight models[J]. Proceedings of the VLDB Endowment, 2019, 12(9): 1044−1057 doi: 10.14778/3329772.3329780
    [10]
    Marcus R, Papaemmanouil O. Towards a hands-free query optimizer through deep learning[J]. arXiv preprint, arXiv: 1809. 10212, 2018
    [11]
    Liu H, Xu Mingbin, Yu Ziting, et al. Cardinality estimation using neural networks[C] //Proc of the 25th Annual Int Conf on Computer Science and Software Engineering. New York: ACM, 2019: 53−59
    [12]
    Liu Jie, Dong Wenqian, Li Dong, et al. Fauce: Fast and accurate deep ensembles with uncertainty for cardinality estimation[J]. Proceedings of the VLDB Endowment, 2021, 14(11): 1950−1963 doi: 10.14778/3476249.3476254
    [13]
    Marcus R, Kipf A, Van Renen A, et al. Flow-loss: Learning cardinality estimates that matter[J]. Proceedings of the VLDB Endowment, 2021, 14(11): 2019−2032 doi: 10.14778/3476249.3476259
    [14]
    Tzoumas K, Deshpande A, Jensen C S. Lightweight graphical models for selectivity estimation without independence assumptions[J]. Proceedings of the VLDB Endowment, 2011, 4(11): 852−863 doi: 10.14778/3402707.3402724
    [15]
    Heimel M, Kiefer M, Markl V. Self-tuning, GPU-accelerated kernel density models for multidimensional selectivity estimation[C] //Proc of the 2015 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2015: 1477−1492
    [16]
    Park Y, Zhong Shucheng, Mozafari B. QuickSel: Quick selectivity learning with mixture models[C] //Proc of the 2020 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2020: 1017−1033
    [17]
    Yang Zongheng, Liang E, Kamsetty A, et al. Deep unsupervised cardinality estimation[J]. Proceedings of the VLDB Endowment, 2019, 13(3): 279−292 doi: 10.14778/3368289.3368294
    [18]
    Hasan S, Thirumuruganathan S, Augustine J, et al. Deep learning models for selectivity estimation of multi-attribute queries[C] //Proc of the 2020 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2020: 1035−1050
    [19]
    Yang Zongheng, Kamsetty A, Luan Sifei, et al. NeuroCard: One cardinality estimator for all tables[J]. Proceedings of the VLDB Endowment, 2021, 14(1): 61−73
    [20]
    Hilprecht B, Schmidt A, Kulessa M, et al. DeepDB: Learn from data, not from queries![J]. Proceedings of the VLDB Endowment, 2020, 13(7): 992−1005 doi: 10.14778/3384345.3384349
    [21]
    Zhu Rong, Wu Ziniu, Han Yuxing, et al. FLAT: Fast, lightweight and accurate method for cardinality estimation[J]. arXiv preprint, arXiv: 2011. 09022, 2020
    [22]
    Lan Hai, Bao Zhifeng, Peng Yuwei. A survey on advancing the dbms query optimizer: Cardinality estimation, cost model, and plan enumeration[J]. Data Science and Engineering, 2021, 6(1): 86−101 doi: 10.1007/s41019-020-00149-7
    [23]
    Sun Ji, Zhang Jintao, Sun Zhaoyan, et al. Learned cardinality estimation: A design space exploration and a comparative evaluation[J]. Proceedings of the VLDB Endowment, 2021, 15(1): 85−97 doi: 10.14778/3485450.3485459
    [24]
    Wang Xiaoying, Qu Changbo, Wu Weiyuan, et al. Are we ready for learned cardinality estimation?[J]. Proceedings of the VLDB Endowment, 2021, 14(9): 1640−1654 doi: 10.14778/3461535.3461552
    [25]
    李国良,周煊赫,孙佶,等. 基于机器学习的数据库技术综述[J]. 计算机学报,2020,43(11):2019−2049

    Li Guoliang, Zhou Xuanhe, Sun Ji, et al. Survey on database technology based on machine learning[J]. Chinese Journal of Computers, 2020, 43(11): 2019−2049 (in Chinese)
    [26]
    Leis V, Gubichev A, Mirchev A, et al. How good are query optimizers, really?[J]. Proceedings of the VLDB Endowment, 2015, 9(3): 204−215 doi: 10.14778/2850583.2850594
    [27]
    Sun Ji, Li Guoliang, Tang Nan. Learned cardinality estimation for similarity queries[C] // Proc of the 2021 Int Conf on Management of Data. New York: ACM, 2021: 1745−1757
    [28]
    Dutt A, Wang Chi, Narasayya V, et al. Efficiently approximating selectivity functions using low overhead regression models[J]. Proceedings of the VLDB Endowment, 2020, 13(11): 2215−2228
    [29]
    Acharya J, Diakonikolas I, Hegde C, et al. Fast and near-optimal algorithms for approximating distributions by histograms[C] //Proc of the 34th ACM SIGMOD-SIGACT-SIGAI Symp on Principles of Database Systems. New York: ACM, 2015: 249−263
    [30]
    Xu Jia, Zhang Zhenjie, Xiao Xiaokui, et al. Differentially private histogram publication[J]. The VLDB Journal, 2013, 22(6): 797−822 doi: 10.1007/s00778-013-0309-y
    [31]
    Ioannidis Y. The history of histograms (abridged)[C] //Proc of the 29th Int Conf on Very Large Data Bases. San Francisco, CA: Morgan Kaufmann, 2003: 19−30
    [32]
    Tschorsch F, Scheuermann B. An algorithm for privacy-preserving distributed user statistics[J]. Computer Networks, 2013, 57(14): 2775−2787 doi: 10.1016/j.comnet.2013.05.011
    [33]
    Flajolet P, Éric F, Gandouet O, et al. HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm[J]. Analysis of Algorithms , 2007: 137−156
    [34]
    Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments[J]. Journal of Computer and System Sciences, 1999, 58(1): 137−147 doi: 10.1006/jcss.1997.1545
    [35]
    Durand M, Flajolet P. Loglog counting of large cardinalities[C] //Proc of the 11th Annual European Symp. Berlin: Springer, 2003: 605−617
    [36]
    Wu Wentao, Naughton J F, Singh H. Sampling-based query re-optimization[C] //Proc of the 2016 Int Conf on Management of Data. New York: ACM, 2016: 1721−1736
    [37]
    Chen Yun, Yi Ke. Two-level sampling for join size estimation[C] // Proc of the 2017 ACM Int Conf on Management of Data. New York: ACM, 2017: 759−774
    [38]
    Leis V, Radke B, Gubichev A, et al. Cardinality estimation done right: Index-based join sampling[C/OL] //Proc of the 8th Biennial Conf on Innovative Data Systems Research. 2017[2023-03-08].https://www.cidrdb.org/cidr2017/papers/p9-leis-cidr17.pdf
    [39]
    Lipton R J, Naughton J F, Schneider D A. Practical selectivity estimation through adaptive sampling[C] //Proc of the 1990 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 1990: 1−11
    [40]
    Olken F, Rotem D. Random sampling from database files: A survey[C] //Proc of the 5th Int Conf on Scientific and Statistical Database Management. Berlin: Springer, 1990: 92−111
    [41]
    Hasan R. Predicting sparql query performance and explaining linked data[C] //Proc of the 11th Int Conf on Semantic Web: Trends and Challenges. Berlin: Springer, 2014: 795−805
    [42]
    Zhang W E, Sheng Q Z, Qin Yongrui, et al. Learning-based SPARQL query performance modeling and prediction[J]. World Wide Web, 2018, 21(4): 1015−1035 doi: 10.1007/s11280-017-0498-1
    [43]
    Neumann T, Weikum G. The RDF-3X engine for scalable management of RDF data[J]. The VLDB Journal, 2010, 19(1): 91−113 doi: 10.1007/s00778-009-0165-y
    [44]
    Stocker M, Seaborne A, Bernstein A, et al. SPARQL basic graph pattern optimization using selectivity estimation[C] //Proc of the 17th Int Conf on World Wide Web. New York: ACM, 2008: 595−604
    [45]
    Vidal M E, Ruckhaus E, Lampo T, et al. Efficiently joining group patterns in SPARQL queries[C] //Proc of the 7th Extended Semantic Web Conf. Berlin: Springer, 2010: 228−242
    [46]
    Huang Hai, Liu Chengfei. Estimating selectivity for joined RDF triple patterns[C] //Proc of the 20th ACM Int Conf on Information and Knowledge Management. New York: ACM, 2011: 1435−1444
    [47]
    Davitkova A, Gjurovski D, Michel S. LMKG: Learned models for cardinality estimation in knowledge Graphs[J]. arXiv preprint, arXiv: 2102. 10588, 2021
    [48]
    Halford M, Saint-Pierre P, Morvan F. An approach based on Bayesian networks for query selectivity estimation[C] // Proc of the 24th Int Conf on Database Systems for Advanced Applications. Berlin: Springer, 2019: 3−19
    [49]
    Liu Qiyu, Shen Yanyan, Chen Lei. LHist: Towards learning multi-dimensional histogram for massive spatial data[C] // Proc of the 37th IEEE Int Conf on Data Engineering. Piscataway, NJ: IEEE, 2021: 1188−1199
    [50]
    Kraska T, Beutel A, Chi E H, et al. The case for learned index structures[C] //Proc of the 2018 Int Conf on Management of Data. New York: ACM, 2018: 489−504
    [51]
    Muralikrishna M, Dewitt D J. Equi-depth histograms for estimating selectivity factors for multi-dimensional[C] //Proc of the 1988 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 1988: 28−36
    [52]
    Wu Peizhi, Cong Gao. A unified deep model of learning from both data and queries for cardinality estimation[C] //Proc of the 2021 Int Conf on Management of Data. New York: ACM, 2021: 2009−2022
  • Related Articles

    [1]Fu Maozhong, Hu Haiyang, Li Zhongjin. Dynamic Resource Scheduling Method for GPU Cluster[J]. Journal of Computer Research and Development, 2023, 60(6): 1308-1321. DOI: 10.7544/issn1000-1239.202220149
    [2]Wang Ziyi, Hu Xiaoyu, Wang Xin, Zhang Xinggong, Cao Zhen, Zheng Kai, Cui Yong. Fairness Measurement and Algorithm Design of Network Transmission: A Case Study of Video Applications[J]. Journal of Computer Research and Development, 2023, 60(4): 810-827. DOI: 10.7544/issn1000-1239.202330022
    [3]Zhong Lujie, Wang Mu. Blockchain-Enpowered Cooperative Resource Allocation Scheme for Computing First Network[J]. Journal of Computer Research and Development, 2023, 60(4): 750-762. DOI: 10.7544/issn1000-1239.202330002
    [4]Fang Rongqiang, Wang Jing, Yao Zhicheng, Liu Chang, Zhang Weigong. Modeling Computational Feature of Multi-Layer Neural Network[J]. Journal of Computer Research and Development, 2019, 56(6): 1170-1181. DOI: 10.7544/issn1000-1239.2019.20190111
    [5]Xu Hongzhi, Li Renfa, Zeng Lining. Parallel Task Scheduling for Resource Consumption Minimization with Reliability Constraint[J]. Journal of Computer Research and Development, 2018, 55(11): 2569-2583. DOI: 10.7544/issn1000-1239.2018.20170893
    [6]Xu Ran, Wang Wendong, Gong Xiangyang, Que Xirong. Delay-Aware Resource Scheduling Optimization in Network Function Virtualization[J]. Journal of Computer Research and Development, 2018, 55(4): 738-747. DOI: 10.7544/issn1000-1239.2018.20170926
    [7]WeiWei, LiuYang, YangWeidong. A Fast Approximation Algorithm for the General Resource Placement Problem in Cloud Computing Platform[J]. Journal of Computer Research and Development, 2016, 53(3): 697-703. DOI: 10.7544/issn1000-1239.2016.20148323
    [8]Fan Pengyi, Wang Hui, Jiang Zhihong, and Li Pei. Measurement of Microblogging Network[J]. Journal of Computer Research and Development, 2012, 49(4): 691-699.
    [9]Xie Yingke, Wang Jiandong, Zhu Chao, Zhao Zili, Han Chengde. High Precision Timestamps in Network Measurement[J]. Journal of Computer Research and Development, 2010, 47(12).
    [10]Jiao Jian, Yao Shan, Li Xiaojian. Research on Network Bidirectional Topology Discovery Based on Measurer by Spreading[J]. Journal of Computer Research and Development, 2010, 47(5): 903-910.

Catalog

    Article views (366) PDF downloads (143) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return