• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

基于Hive的计算结果特征提取与重用策略

谢恒, 王梅, 乐嘉锦, 孙莉

谢恒, 王梅, 乐嘉锦, 孙莉. 基于Hive的计算结果特征提取与重用策略[J]. 计算机研究与发展, 2015, 52(9): 2014-2024. DOI: 10.7544/issn1000-1239.2015.20140548
引用本文: 谢恒, 王梅, 乐嘉锦, 孙莉. 基于Hive的计算结果特征提取与重用策略[J]. 计算机研究与发展, 2015, 52(9): 2014-2024. DOI: 10.7544/issn1000-1239.2015.20140548
Xie Heng, Wang Mei, Le Jiajin, Sun Li. Calculation Results Characteristics Extract and Reuse Strategy Based on Hive[J]. Journal of Computer Research and Development, 2015, 52(9): 2014-2024. DOI: 10.7544/issn1000-1239.2015.20140548
Citation: Xie Heng, Wang Mei, Le Jiajin, Sun Li. Calculation Results Characteristics Extract and Reuse Strategy Based on Hive[J]. Journal of Computer Research and Development, 2015, 52(9): 2014-2024. DOI: 10.7544/issn1000-1239.2015.20140548
谢恒, 王梅, 乐嘉锦, 孙莉. 基于Hive的计算结果特征提取与重用策略[J]. 计算机研究与发展, 2015, 52(9): 2014-2024. CSTR: 32373.14.issn1000-1239.2015.20140548
引用本文: 谢恒, 王梅, 乐嘉锦, 孙莉. 基于Hive的计算结果特征提取与重用策略[J]. 计算机研究与发展, 2015, 52(9): 2014-2024. CSTR: 32373.14.issn1000-1239.2015.20140548
Xie Heng, Wang Mei, Le Jiajin, Sun Li. Calculation Results Characteristics Extract and Reuse Strategy Based on Hive[J]. Journal of Computer Research and Development, 2015, 52(9): 2014-2024. CSTR: 32373.14.issn1000-1239.2015.20140548
Citation: Xie Heng, Wang Mei, Le Jiajin, Sun Li. Calculation Results Characteristics Extract and Reuse Strategy Based on Hive[J]. Journal of Computer Research and Development, 2015, 52(9): 2014-2024. CSTR: 32373.14.issn1000-1239.2015.20140548

基于Hive的计算结果特征提取与重用策略

基金项目: 国家自然科学基金项目(61103046);中央高校基本科研业务费专项和东华大学“励志计划”(B201312)
详细信息
  • 中图分类号: TP311

Calculation Results Characteristics Extract and Reuse Strategy Based on Hive

  • 摘要: 现有MapReduce工作流中作业之间需将计算结果物化到HDFS(Hadoop distributed file system),大量磁盘I/O导致其效率较低.基于现有代表性工作Hive,提取并保存MapReduce工作流产生计算结果的数据特征,提出一种计算结果匹配和重用策略.首先,根据查询条件定义连接图与连接体等结构,用于可复用计算结果的匹配.基于该结构,根据HiveQL(Hive query language)解析出的抽象语法树,提出生成查询语句连接体算法,并遍历候选连接体列表,给出最佳重用方案生成方法,包括单连接体重用和多连接体重用策略.进一步,为了增加计算结果的重用概率,提出多键选择、推迟算数运算和语义理解3种方法.最后,使用数据仓库基准测试数据集TPC-H和SSB进行实验,验证了所提出的重用计算结果以提高数据处理速度的有效性.
    Abstract: Jobs in MapReduce workflow need to materialize intermediate data into HDFS (Hadoop distributed file system), which causes a large amount of I/O overhead and low efficiency. Based on existing representative work Hive, this paper proposes a strategy to match and reuse the MapReduce calculation results by extracting and storing the characteristics of the results. Firstly, we define Join-Graph, Join-Object and other structures according to the query condition, which can be used to find reusable results. Based on the abstract syntax tree generated by HiveQL (Hive query language) parser, an algorithm is proposed to generate Join-Object of the query. Followed by traversing the candidate Join-Object list, an algorithm is provided to generate the best reuse solution including single Join-Object and multiple Join-Objects reuse. In addition, we provide three methods to increase the reuse probability, including multi-key selection, arithmetic delay and semantic understanding. Finally, we conduct the experiments using TPC-H and SSB benchmarks. The results show that the efficiency is improved by 28%-52% when reusing single Join-Object by TPC-H, while it is improved by up to 75% when reusing multiple Join-Objects, and the efficiency of all the 22 queries is improved by 15.7% on average. By SSB, the efficiency is improved by 40% to 76%, 55% on average.
计量
  • 文章访问数:  1423
  • HTML全文浏览量:  2
  • PDF下载量:  687
  • 被引次数: 0
出版历程
  • 发布日期:  2015-08-31

目录

    /

    返回文章
    返回