基于Hive的计算结果特征提取与重用策略

谢恒; 王梅; 乐嘉锦; 孙莉

doi:10.7544/issn1000-1239.2015.20140548

基于Hive的计算结果特征提取与重用策略

Calculation Results Characteristics Extract and Reuse Strategy Based on Hive

摘要

摘要: 现有MapReduce工作流中作业之间需将计算结果物化到HDFS(Hadoop distributed file system)，大量磁盘I/O导致其效率较低.基于现有代表性工作Hive，提取并保存MapReduce工作流产生计算结果的数据特征，提出一种计算结果匹配和重用策略.首先，根据查询条件定义连接图与连接体等结构，用于可复用计算结果的匹配.基于该结构，根据HiveQL(Hive query language)解析出的抽象语法树，提出生成查询语句连接体算法，并遍历候选连接体列表，给出最佳重用方案生成方法，包括单连接体重用和多连接体重用策略.进一步，为了增加计算结果的重用概率，提出多键选择、推迟算数运算和语义理解3种方法.最后，使用数据仓库基准测试数据集TPC-H和SSB进行实验，验证了所提出的重用计算结果以提高数据处理速度的有效性.

Abstract: Jobs in MapReduce workflow need to materialize intermediate data into HDFS (Hadoop distributed file system), which causes a large amount of I/O overhead and low efficiency. Based on existing representative work Hive, this paper proposes a strategy to match and reuse the MapReduce calculation results by extracting and storing the characteristics of the results. Firstly, we define Join-Graph, Join-Object and other structures according to the query condition, which can be used to find reusable results. Based on the abstract syntax tree generated by HiveQL (Hive query language) parser, an algorithm is proposed to generate Join-Object of the query. Followed by traversing the candidate Join-Object list, an algorithm is provided to generate the best reuse solution including single Join-Object and multiple Join-Objects reuse. In addition, we provide three methods to increase the reuse probability, including multi-key selection, arithmetic delay and semantic understanding. Finally, we conduct the experiments using TPC-H and SSB benchmarks. The results show that the efficiency is improved by 28%-52% when reusing single Join-Object by TPC-H, while it is improved by up to 75% when reusing multiple Join-Objects, and the efficiency of all the 22 queries is improved by 15.7% on average. By SSB, the efficiency is improved by 40% to 76%, 55% on average.

HTML全文

参考文献(0)

施引文献

资源附件(0)