A Benefit Model Based Data Reuse Mechanism for Spark SQL
-
摘要: 通过数据分析发现海量数据中的潜在价值,能够带来巨大的收益.Spark具有良好的系统扩展性与处理性能,因而被广泛运用于大数据分析.Spark SQL是Spark最常用的编程接口.在数据分析应用中存在着大量的重复计算,这些重复计算不仅浪费系统资源,而且导致查询运行效率低.但是Spark SQL无法感知查询语句之间的重复计算.为此,提出了基于收益模型的、细粒度的自动数据重用机制Criss以减少重复计算.针对混合介质,提出了感知异构I/O性能的收益模型用于自动识别重用收益最大的算子计算结果,并采用Partition粒度的数据重用和缓存管理,以提高查询效率和缓存空间的利用率,充分发挥数据重用的优势.基于Spark SQL和TachyonFS,实现了Criss系统.实验结果表明:Criss的查询性能比原始Spark SQL提升了46%~68%.Abstract: Analyzing massive data to discover the potential values in them can bring great benefits. Spark is a widely used data analytics engine for large-scale data processing due to its good scalability and high performance. Spark SQL is the most commonly used programming interface for Spark. There are a lot of redundant computations in data analytic applications. Such redundancies not only waste system resources but also prolong the execution time of queries. However, current implementation of Spark SQL is not aware of redundant computations among data analytic queries, and hence cannot remove them. To address this issue, we present a benefit model based, fine-grained, automatic data reuse mechanism called Criss in this paper. Criss automatically identifies redundant computations among queries. Then it uses an I/O performance aware benefit model to automatically choose the operator results with the biggest benefit and cache these results using a hybrid storage consisting of both memory and HDD. Moreover, cache management and data reuse in Criss are partition-based instead of the whole result of an operator. Such fine-grained mechanism greatly improves query performance and storage utilization. We implement Criss in Spark SQL using modified TachyonFS for data caching. Our experiment results show that Criss outperforms Spark SQL by 40% to 68%.
-
Keywords:
- data analytics /
- big data /
- Spark SQL /
- redundant computation /
- data reuse /
- benefit model
-
-
期刊类型引用(3)
1. 陈春茹. 基于Spark SQL的数据查询与索引优化系统研究. 信息技术与信息化. 2024(07): 170-173 . 百度学术
2. 秦慧娟. 基于SQL的教育资源数据库索引自动推荐模型. 自动化技术与应用. 2022(10): 117-120+136 . 百度学术
3. 白小曼,冯永祥,李雷孝,张利平,马志强,王永生,王慧. 针对城市道路拥堵的优化随机森林预测模型. 科学技术与工程. 2021(26): 11205-11211 . 百度学术
其他类型引用(4)
计量
- 文章访问数: 1018
- HTML全文浏览量: 2
- PDF下载量: 358
- 被引次数: 7