ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2020, Vol. 57 ›› Issue (2): 318-332.doi: 10.7544/issn1000-1239.2020.20190563

Special Issue: 2020大数据与智能存储系统前沿技术专题

Previous Articles     Next Articles

A Benefit Model Based Data Reuse Mechanism for Spark SQL

Shen Yijie, Zeng Dan, and Xiong Jin   

  1. (State Key Laboratory of Computer Architecture (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190) (University of Chinese Academy of Sciences, Beijing 100049)
  • Online:2020-02-01
  • Supported by: 
    This work was supported by the National Key Research and Development Program (2016YFB1000202) and the National Natural Science Foundation of China (61379042).

Abstract: Analyzing massive data to discover the potential values in them can bring great benefits. Spark is a widely used data analytics engine for large-scale data processing due to its good scalability and high performance. Spark SQL is the most commonly used programming interface for Spark. There are a lot of redundant computations in data analytic applications. Such redundancies not only waste system resources but also prolong the execution time of queries. However, current implementation of Spark SQL is not aware of redundant computations among data analytic queries, and hence cannot remove them. To address this issue, we present a benefit model based, fine-grained, automatic data reuse mechanism called Criss in this paper. Criss automatically identifies redundant computations among queries. Then it uses an I/O performance aware benefit model to automatically choose the operator results with the biggest benefit and cache these results using a hybrid storage consisting of both memory and HDD. Moreover, cache management and data reuse in Criss are partition-based instead of the whole result of an operator. Such fine-grained mechanism greatly improves query performance and storage utilization. We implement Criss in Spark SQL using modified TachyonFS for data caching. Our experiment results show that Criss outperforms Spark SQL by 40% to 68%.

Key words: data analytics, big data, Spark SQL, redundant computation, data reuse, benefit model

CLC Number: