ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (6): 1452-1462.doi: 10.7544/issn1000-1239.2015.20140403

• 软件技术 • 上一篇    

基于统计方法的Hive数据仓库查询优化实现

王有为1,王伟平2,孟丹2   

  1. 1(中国科学院计算技术研究所集成应用中心 北京 100190);2(中国科学院信息工程研究所 北京 100093) (wangyouwei@ncic.ac.cn)
  • 出版日期: 2015-06-01
  • 基金资助: 
    基金项目:国家“八六三”高技术研究发展计划基金项目(2013AA013204);“核高基”国家科技重大专项基金项目(2013ZX01039-002-001-001);中国科学院战略性先导科技专项项目(XDA06030200)

Query Optimization by Statistical Approach for Hive Data Warehouse

Wang Youwei1, Wang Weiping2, Meng Dan2   

  1. 1(Integration Application Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190);2(Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093)
  • Online: 2015-06-01

摘要: Map/Reduce是海量离线数据分析中广泛应用的并行编程模型.Hive数据仓库基于Map/Reduce实现了查询处理引擎,然而Map/Reduce框架在处理偏斜数据时会出现工作负载分布不均的问题.均衡计算模型(computation balanced model, CBM),其核心思想是通过数据分布特征指导查询计划优化.相应研究贡献包括2部分,首先针对应用极广的GroupBy查询和Join查询建立了运行估价模型,确定了不同场景下查询计划的优化选择分支;其次基于Hive ETL机制设计了一种统计信息收集方法,解决了统计海量数据分布特征的问题.实验数据表明,通过CBM优化的 GroupBy查询耗时节省了8%~45%,Join查询耗时节省了12%~46%;集群CPU负载均衡指标优化了60%~80%,I/O负载均衡指标优化了60%~90%.实验结果证实了基于CBM模型优化的查询计划生成器能显著均衡化Hive查询运行时的集群负载,并优化了查询处理效率.

关键词: 海量数据离线处理, 分布式数据仓库, 负载均衡, 统计信息收集, 查询优化

Abstract: Map/Reduce is an efficient parallel programming model, which is now widely utilized to analyze massive data. Hive is an open source data warehouse which utilizes Map/Reduce to implement its query processing engine. However, the issue of unbalanced workload distribution in the whole cluster arises when processing skewed data using Map/Reduce. The possible results range from low runtime efficiency to task failures. To solve such problem, we propose an approach named the computation balanced model (CBM), which optimizes to queries by using distribution statistics. The main contributions of this paper include two parts correspondingly: firstly, the runtime cost evaluation model is established for two widely-used types of queries, i.e., the GroupBy and Join queries, especially under different situations; secondly, the highly-efficient statistics approach for massive data is designed and implemented adapting to the data access mechanism of Hive. Experiment results show the processing time of GroupBy query optimized by CBM is reduced by about 8%-45%, while the processing time of Join query is reduced by over 12%-46%. And the balance distribution of cluster payload is improved by about 60%-80% for CPU and 60%-90% for I/O. We believe the optimized query plan generator by CBM significantly balances the payload distribution during the execution of Map/Reduce tasks, as well as improves the query efficiency greatly.

Key words: offline processing of massive data, distributed data warehouse, payload balance, statistics information collection, query optimization

中图分类号: