ISSN 1000-1239 CN 11-1777/TP

• 软件技术 •

### 基于Hive的计算结果特征提取与重用策略

1. (东华大学计算机科学与技术学院 上海 201620) (xie_heng@foxmail.com)
• 出版日期: 2015-09-01
• 基金资助:
基金项目：国家自然科学基金项目(61103046)；中央高校基本科研业务费专项和东华大学“励志计划”(B201312)

### Calculation Results Characteristics Extract and Reuse Strategy Based on Hive

Xie Heng, Wang Mei, Le Jiajin, Sun Li

1. (College of Computer Science and Technology, Donghua University, Shanghai 201620)
• Online: 2015-09-01

Abstract: Jobs in MapReduce workflow need to materialize intermediate data into HDFS (Hadoop distributed file system), which causes a large amount of I/O overhead and low efficiency. Based on existing representative work Hive, this paper proposes a strategy to match and reuse the MapReduce calculation results by extracting and storing the characteristics of the results. Firstly, we define Join-Graph, Join-Object and other structures according to the query condition, which can be used to find reusable results. Based on the abstract syntax tree generated by HiveQL (Hive query language) parser, an algorithm is proposed to generate Join-Object of the query. Followed by traversing the candidate Join-Object list, an algorithm is provided to generate the best reuse solution including single Join-Object and multiple Join-Objects reuse. In addition, we provide three methods to increase the reuse probability, including multi-key selection, arithmetic delay and semantic understanding. Finally, we conduct the experiments using TPC-H and SSB benchmarks. The results show that the efficiency is improved by 28%-52% when reusing single Join-Object by TPC-H, while it is improved by up to 75% when reusing multiple Join-Objects, and the efficiency of all the 22 queries is improved by 15.7% on average. By SSB, the efficiency is improved by 40% to 76%, 55% on average.