ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2016, Vol. 53 ›› Issue (2): 467-478.doi: 10.7544/issn1000-1239.2016.20148281

Previous Articles     Next Articles

Multi-Way Join Optimization Approach Based on MapReduce

Li Tiantian1, Yu Ge1, Guo Chaopeng2, Song Jie2   

  1. 1(College of Computer Science and Engineering, Northeastern University, Shenyang 110819); 2(Software College, Northeastern University, Shenyang 110819)
  • Online:2016-02-01

Abstract: Multi-way join is one of the most common data analysis operations, and MapReduce programming model that has been widely used to process large scale data sets has brought new challenges to multi-way join optimization. Traditional optimization approaches cannot be simply adapted to fit MapReduce feature, so there is still optimization room for MapReduce join algorithm. As to the former, we think I/O is the main cost of join. This paper first proposes an I/O cost based heuristic algorithm to initially determine a join sequence, and conducts further optimization. After the optimization, we also design a parallel execution algorithm to improve the whole performance of multi-way join. As to the latter, we think load balancing can effectively decrease the “buckets effect” of MapReduce. This paper proposes a fair task load allocation algorithm to improve the intra-join parallelism, and also analyzes the method to decide the appropriate number of Reduce tasks. Experiments verify the effectiveness of the proposed optimization approaches. This study contributes to multi-way join applications in big data environment, such as the star-join in OLAP and the chain-join in social network.

Key words: multi-way join, execution plan, I/O cost, performance optimization, MapReduce programming model, load balancing

CLC Number: