基于MapReduce的多元连接优化方法

李甜甜; 于戈; 郭朝鹏; 宋杰

doi:10.7544/issn1000-1239.2016.20148281

基于MapReduce的多元连接优化方法

Multi-Way Join Optimization Approach Based on MapReduce

摘要

摘要: 多元连接是数据分析最常用的操作之一，MapReduce是广泛用于大规模数据分析处理的编程模型，它给多元连接优化带来新的挑战：传统的优化方法不能简单地适用到MapReduce中；MapReduce连接执行算法尚存优化空间.针对前者，考虑到I/O代价是连接运算的主要代价，首先以降低I/O代价为目标提出一种启发式算法确定多元连接执行顺序，并在此基础上进一步优化，最后针对MapReduce设计一种并行执行策略提高多元连接的整体性能.针对后者，考虑到负载均衡能够有效减少MapReduce的“木桶效应”，通过任务公平分配算法提高连接内部的并行度，并在此基础上给出Reduce任务个数的确定方法.最后，通过实验验证本文提出的执行计划确定方法以及负载均衡算法的优化效果.该研究对大数据环境下MapReduce多元连接的应用具有指导意义，可以优化如OLAP分析中的星型连接、社交网络中社团发现的链式连接等应用的性能.

Abstract: Multi-way join is one of the most common data analysis operations, and MapReduce programming model that has been widely used to process large scale data sets has brought new challenges to multi-way join optimization. Traditional optimization approaches cannot be simply adapted to fit MapReduce feature, so there is still optimization room for MapReduce join algorithm. As to the former, we think I/O is the main cost of join. This paper first proposes an I/O cost based heuristic algorithm to initially determine a join sequence, and conducts further optimization. After the optimization, we also design a parallel execution algorithm to improve the whole performance of multi-way join. As to the latter, we think load balancing can effectively decrease the “buckets effect” of MapReduce. This paper proposes a fair task load allocation algorithm to improve the intra-join parallelism, and also analyzes the method to decide the appropriate number of Reduce tasks. Experiments verify the effectiveness of the proposed optimization approaches. This study contributes to multi-way join applications in big data environment, such as the star-join in OLAP and the chain-join in social network.

HTML全文

参考文献(0)

施引文献

资源附件(0)