Abstract:
Map/Reduce is an efficient parallel programming model, which is now widely utilized to analyze massive data. Hive is an open source data warehouse which utilizes Map/Reduce to implement its query processing engine. However, the issue of unbalanced workload distribution in the whole cluster arises when processing skewed data using Map/Reduce. The possible results range from low runtime efficiency to task failures. To solve such problem, we propose an approach named the computation balanced model (CBM), which optimizes to queries by using distribution statistics. The main contributions of this paper include two parts correspondingly: firstly, the runtime cost evaluation model is established for two widely-used types of queries, i.e., the GroupBy and Join queries, especially under different situations; secondly, the highly-efficient statistics approach for massive data is designed and implemented adapting to the data access mechanism of Hive. Experiment results show the processing time of GroupBy query optimized by CBM is reduced by about 8%-45%, while the processing time of Join query is reduced by over 12%-46%. And the balance distribution of cluster payload is improved by about 60%-80% for CPU and 60%-90% for I/O. We believe the optimized query plan generator by CBM significantly balances the payload distribution during the execution of Map/Reduce tasks, as well as improves the query efficiency greatly.