ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2016, Vol. 53 ›› Issue (4): 798-810.doi: 10.7544/issn1000-1239.2016.20151163

Previous Articles     Next Articles

A Cost-Based Splitting Policy Search Algorithm for Hive Multi-Dimensional Index

Liu Yue1,2, Li Jintao1, Hu Songlin3   

  1. 1Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190); 2University of Chinese Academy of Sciences, Beijing 100049); 3Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093)
  • Online:2016-04-01

Abstract: In the domain of energy Internet, smart city, etc, the massive smart devices collect large amount of data every day, and traditional enterprises need to perform lots of multi-dimensional analysis on these data to support decision-making. Recently, these enterprises try to solve the big data problem with technologies from Internet companies, for example, Hadoop and Hive etc. However, Hive has limited multi-dimensional index ability, and cannot satisfy the requirements of high-performance analysis in traditional enterprises. In this paper, we propose a distributed grid file based multi-dimensional index—DGFIndex to improve the multi-dimensional query performance of Hive. However, DGFIndex needs user to specify the splitting policy when creating index, which is not trivial for user when they are not familiar with data and query pattern. To solve it, we propose a novel MapReduce cost model to measure the DGFIndex-based query performance on specific splitting policy, a two-phase simulated annealing algorithm to search for the suitable splitting policy for DGFIndex, and finally decrease the total cost time of query set. The experimental results show that, DGFIndex improves 50%~114% query performance than original Compact Index in Hive. For static query set, compared with manual-specifying partition policy, our algorithm can choose suitable interval size for each index dimension, and decrease the cost time of query set at most 30%.

Key words: Hive, MapReduce, multi-dimensional index, cost model, simulated annealing

CLC Number: