ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2020, Vol. 57 ›› Issue (2): 318-332.doi: 10.7544/issn1000-1239.2020.20190563

所属专题: 2020大数据与智能存储系统前沿技术专题

• 系统结构 • 上一篇    下一篇

基于收益模型的Spark SQL数据重用机制

申毅杰, 曾 丹, 熊 劲   

  1. (计算机体系结构国家重点实验室(中国科学院计算技术研究所) 北京 100190) (中国科学院大学 北京 100049) (
  • 出版日期: 2020-02-01
  • 基金资助: 

A Benefit Model Based Data Reuse Mechanism for Spark SQL

Shen Yijie, Zeng Dan, and Xiong Jin   

  1. (State Key Laboratory of Computer Architecture (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190) (University of Chinese Academy of Sciences, Beijing 100049)
  • Online: 2020-02-01
  • Supported by: 
    This work was supported by the National Key Research and Development Program (2016YFB1000202) and the National Natural Science Foundation of China (61379042).

摘要: 通过数据分析发现海量数据中的潜在价值,能够带来巨大的收益.Spark具有良好的系统扩展性与处理性能,因而被广泛运用于大数据分析.Spark SQL是Spark最常用的编程接口.在数据分析应用中存在着大量的重复计算,这些重复计算不仅浪费系统资源,而且导致查询运行效率低.但是Spark SQL无法感知查询语句之间的重复计算.为此,提出了基于收益模型的、细粒度的自动数据重用机制Criss以减少重复计算.针对混合介质,提出了感知异构I/O性能的收益模型用于自动识别重用收益最大的算子计算结果,并采用Partition粒度的数据重用和缓存管理,以提高查询效率和缓存空间的利用率,充分发挥数据重用的优势.基于Spark SQL和TachyonFS,实现了Criss系统.实验结果表明:Criss的查询性能比原始Spark SQL提升了46%~68%.

关键词: 数据分析, 大数据, Spark SQL, 重复计算, 数据重用, 收益模型

Abstract: Analyzing massive data to discover the potential values in them can bring great benefits. Spark is a widely used data analytics engine for large-scale data processing due to its good scalability and high performance. Spark SQL is the most commonly used programming interface for Spark. There are a lot of redundant computations in data analytic applications. Such redundancies not only waste system resources but also prolong the execution time of queries. However, current implementation of Spark SQL is not aware of redundant computations among data analytic queries, and hence cannot remove them. To address this issue, we present a benefit model based, fine-grained, automatic data reuse mechanism called Criss in this paper. Criss automatically identifies redundant computations among queries. Then it uses an I/O performance aware benefit model to automatically choose the operator results with the biggest benefit and cache these results using a hybrid storage consisting of both memory and HDD. Moreover, cache management and data reuse in Criss are partition-based instead of the whole result of an operator. Such fine-grained mechanism greatly improves query performance and storage utilization. We implement Criss in Spark SQL using modified TachyonFS for data caching. Our experiment results show that Criss outperforms Spark SQL by 40% to 68%.

Key words: data analytics, big data, Spark SQL, redundant computation, data reuse, benefit model