基于收益模型的Spark SQL数据重用机制

申毅杰; 曾丹; 熊劲

doi:10.7544/issn1000-1239.2020.20190563

基于收益模型的Spark SQL数据重用机制

A Benefit Model Based Data Reuse Mechanism for Spark SQL

摘要

摘要: 通过数据分析发现海量数据中的潜在价值，能够带来巨大的收益.Spark具有良好的系统扩展性与处理性能，因而被广泛运用于大数据分析.Spark SQL是Spark最常用的编程接口.在数据分析应用中存在着大量的重复计算，这些重复计算不仅浪费系统资源，而且导致查询运行效率低.但是Spark SQL无法感知查询语句之间的重复计算.为此，提出了基于收益模型的、细粒度的自动数据重用机制Criss以减少重复计算.针对混合介质，提出了感知异构I/O性能的收益模型用于自动识别重用收益最大的算子计算结果，并采用Partition粒度的数据重用和缓存管理，以提高查询效率和缓存空间的利用率，充分发挥数据重用的优势.基于Spark SQL和TachyonFS，实现了Criss系统.实验结果表明：Criss的查询性能比原始Spark SQL提升了46%~68%.

Abstract: Analyzing massive data to discover the potential values in them can bring great benefits. Spark is a widely used data analytics engine for large-scale data processing due to its good scalability and high performance. Spark SQL is the most commonly used programming interface for Spark. There are a lot of redundant computations in data analytic applications. Such redundancies not only waste system resources but also prolong the execution time of queries. However, current implementation of Spark SQL is not aware of redundant computations among data analytic queries, and hence cannot remove them. To address this issue, we present a benefit model based, fine-grained, automatic data reuse mechanism called Criss in this paper. Criss automatically identifies redundant computations among queries. Then it uses an I/O performance aware benefit model to automatically choose the operator results with the biggest benefit and cache these results using a hybrid storage consisting of both memory and HDD. Moreover, cache management and data reuse in Criss are partition-based instead of the whole result of an operator. Such fine-grained mechanism greatly improves query performance and storage utilization. We implement Criss in Spark SQL using modified TachyonFS for data caching. Our experiment results show that Criss outperforms Spark SQL by 40% to 68%.

HTML全文

参考文献(0)

施引文献

资源附件(0)