ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2015, Vol. 52 ›› Issue (6): 1452-1462.doi: 10.7544/issn1000-1239.2015.20140403

Previous Articles    

Query Optimization by Statistical Approach for Hive Data Warehouse

Wang Youwei1, Wang Weiping2, Meng Dan2   

  1. 1(Integration Application Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190);2(Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093)
  • Online:2015-06-01

Abstract: Map/Reduce is an efficient parallel programming model, which is now widely utilized to analyze massive data. Hive is an open source data warehouse which utilizes Map/Reduce to implement its query processing engine. However, the issue of unbalanced workload distribution in the whole cluster arises when processing skewed data using Map/Reduce. The possible results range from low runtime efficiency to task failures. To solve such problem, we propose an approach named the computation balanced model (CBM), which optimizes to queries by using distribution statistics. The main contributions of this paper include two parts correspondingly: firstly, the runtime cost evaluation model is established for two widely-used types of queries, i.e., the GroupBy and Join queries, especially under different situations; secondly, the highly-efficient statistics approach for massive data is designed and implemented adapting to the data access mechanism of Hive. Experiment results show the processing time of GroupBy query optimized by CBM is reduced by about 8%-45%, while the processing time of Join query is reduced by over 12%-46%. And the balance distribution of cluster payload is improved by about 60%-80% for CPU and 60%-90% for I/O. We believe the optimized query plan generator by CBM significantly balances the payload distribution during the execution of Map/Reduce tasks, as well as improves the query efficiency greatly.

Key words: offline processing of massive data, distributed data warehouse, payload balance, statistics information collection, query optimization

CLC Number: