ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2016, Vol. 53 ›› Issue (2): 467-478.doi: 10.7544/issn1000-1239.2016.20148281

• 软件技术 • 上一篇    下一篇

基于MapReduce的多元连接优化方法

李甜甜1,于戈1,郭朝鹏2,宋杰2   

  1. 1(东北大学计算机科学与工程学院 沈阳 110819); 2(东北大学软件学院 沈阳 110819) (litiantian_neu@163.com)
  • 出版日期: 2016-02-01
  • 基金资助: 
    国家自然科学基金重大项目(61433008);国家自然科学基金青年基金项目(61202088);国家博士后科学基金面上项目(2013M540232);中央高校基本科研业务费专项基金项目(N120817001);教育部高等学校博士学科点博导基金项目(20120042110028)

Multi-Way Join Optimization Approach Based on MapReduce

Li Tiantian1, Yu Ge1, Guo Chaopeng2, Song Jie2   

  1. 1(College of Computer Science and Engineering, Northeastern University, Shenyang 110819); 2(Software College, Northeastern University, Shenyang 110819)
  • Online: 2016-02-01

摘要: 多元连接是数据分析最常用的操作之一,MapReduce是广泛用于大规模数据分析处理的编程模型,它给多元连接优化带来新的挑战:传统的优化方法不能简单地适用到MapReduce中;MapReduce连接执行算法尚存优化空间.针对前者,考虑到I/O代价是连接运算的主要代价,首先以降低I/O代价为目标提出一种启发式算法确定多元连接执行顺序,并在此基础上进一步优化,最后针对MapReduce设计一种并行执行策略提高多元连接的整体性能.针对后者,考虑到负载均衡能够有效减少MapReduce的“木桶效应”,通过任务公平分配算法提高连接内部的并行度,并在此基础上给出Reduce任务个数的确定方法.最后,通过实验验证本文提出的执行计划确定方法以及负载均衡算法的优化效果.该研究对大数据环境下MapReduce多元连接的应用具有指导意义,可以优化如OLAP分析中的星型连接、社交网络中社团发现的链式连接等应用的性能.

关键词: 多元连接, 执行计划, I/O代价, 性能优化, MapReduce编程模型, 负载均衡

Abstract: Multi-way join is one of the most common data analysis operations, and MapReduce programming model that has been widely used to process large scale data sets has brought new challenges to multi-way join optimization. Traditional optimization approaches cannot be simply adapted to fit MapReduce feature, so there is still optimization room for MapReduce join algorithm. As to the former, we think I/O is the main cost of join. This paper first proposes an I/O cost based heuristic algorithm to initially determine a join sequence, and conducts further optimization. After the optimization, we also design a parallel execution algorithm to improve the whole performance of multi-way join. As to the latter, we think load balancing can effectively decrease the “buckets effect” of MapReduce. This paper proposes a fair task load allocation algorithm to improve the intra-join parallelism, and also analyzes the method to decide the appropriate number of Reduce tasks. Experiments verify the effectiveness of the proposed optimization approaches. This study contributes to multi-way join applications in big data environment, such as the star-join in OLAP and the chain-join in social network.

Key words: multi-way join, execution plan, I/O cost, performance optimization, MapReduce programming model, load balancing

中图分类号: