ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2016, Vol. 53 ›› Issue (4): 798-810.doi: 10.7544/issn1000-1239.2016.20151163

• 软件技术 • 上一篇    下一篇



  1. 1(中国科学院计算技术研究所 北京 100190);2(中国科学院大学 北京 100049);3(中国科学院信息工程研究所 北京 100093) (
  • 出版日期: 2016-04-01
  • 基金资助: 

A Cost-Based Splitting Policy Search Algorithm for Hive Multi-Dimensional Index

Liu Yue1,2, Li Jintao1, Hu Songlin3   

  1. 1Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190); 2University of Chinese Academy of Sciences, Beijing 100049); 3Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093)
  • Online: 2016-04-01

摘要: 在能源互联网、智慧城市等新兴领域,智能终端采集的庞大数据往往需要多维分析,传统企业寻求借助互联网技术(如Hadoop和Hive)应对大数据问题.但是Hive当前的多维索引能力较弱,无法满足传统企业的需求.针对这一问题,提出了一种基于分布式网格文件的多维索引技术——DGFIndex,来提升Hive的多维查询处理能力.但是在创建DGFIndex时,需要用户指定各个索引维度的分割粒度,而分割粒度的大小与查询性能息息相关.在用户对数据与查询特征不熟悉时,很难选择较优的分割策略.为了解决这一问题,通过建立新的MapReduce代价模型,并使用两阶段模拟退火算法为DGFIndex搜索较优的分割策略,从而提升查询性能,减少查询集合的总耗时.实验结果表明:DGFIndex可以提升Hive多维查询性能50%~114%,对于固定的查询集合,与人工选定分割策略比较,基于代价估计的分割策略选择算法可以为DGFIndex快速选定较优的分割策略,并可以使整个查询集合的处理时间比人工方法最多减少30%.

关键词: Hive, MapReduce, 多维索引, 代价模型, 模拟退火

Abstract: In the domain of energy Internet, smart city, etc, the massive smart devices collect large amount of data every day, and traditional enterprises need to perform lots of multi-dimensional analysis on these data to support decision-making. Recently, these enterprises try to solve the big data problem with technologies from Internet companies, for example, Hadoop and Hive etc. However, Hive has limited multi-dimensional index ability, and cannot satisfy the requirements of high-performance analysis in traditional enterprises. In this paper, we propose a distributed grid file based multi-dimensional index—DGFIndex to improve the multi-dimensional query performance of Hive. However, DGFIndex needs user to specify the splitting policy when creating index, which is not trivial for user when they are not familiar with data and query pattern. To solve it, we propose a novel MapReduce cost model to measure the DGFIndex-based query performance on specific splitting policy, a two-phase simulated annealing algorithm to search for the suitable splitting policy for DGFIndex, and finally decrease the total cost time of query set. The experimental results show that, DGFIndex improves 50%~114% query performance than original Compact Index in Hive. For static query set, compared with manual-specifying partition policy, our algorithm can choose suitable interval size for each index dimension, and decrease the cost time of query set at most 30%.

Key words: Hive, MapReduce, multi-dimensional index, cost model, simulated annealing