• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

基于MapReduce的多元连接优化方法

李甜甜, 于戈, 郭朝鹏, 宋杰

李甜甜, 于戈, 郭朝鹏, 宋杰. 基于MapReduce的多元连接优化方法[J]. 计算机研究与发展, 2016, 53(2): 467-478. DOI: 10.7544/issn1000-1239.2016.20148281
引用本文: 李甜甜, 于戈, 郭朝鹏, 宋杰. 基于MapReduce的多元连接优化方法[J]. 计算机研究与发展, 2016, 53(2): 467-478. DOI: 10.7544/issn1000-1239.2016.20148281
Li Tiantian, Yu Ge, Guo Chaopeng, Song Jie. Multi-Way Join Optimization Approach Based on MapReduce[J]. Journal of Computer Research and Development, 2016, 53(2): 467-478. DOI: 10.7544/issn1000-1239.2016.20148281
Citation: Li Tiantian, Yu Ge, Guo Chaopeng, Song Jie. Multi-Way Join Optimization Approach Based on MapReduce[J]. Journal of Computer Research and Development, 2016, 53(2): 467-478. DOI: 10.7544/issn1000-1239.2016.20148281
李甜甜, 于戈, 郭朝鹏, 宋杰. 基于MapReduce的多元连接优化方法[J]. 计算机研究与发展, 2016, 53(2): 467-478. CSTR: 32373.14.issn1000-1239.2016.20148281
引用本文: 李甜甜, 于戈, 郭朝鹏, 宋杰. 基于MapReduce的多元连接优化方法[J]. 计算机研究与发展, 2016, 53(2): 467-478. CSTR: 32373.14.issn1000-1239.2016.20148281
Li Tiantian, Yu Ge, Guo Chaopeng, Song Jie. Multi-Way Join Optimization Approach Based on MapReduce[J]. Journal of Computer Research and Development, 2016, 53(2): 467-478. CSTR: 32373.14.issn1000-1239.2016.20148281
Citation: Li Tiantian, Yu Ge, Guo Chaopeng, Song Jie. Multi-Way Join Optimization Approach Based on MapReduce[J]. Journal of Computer Research and Development, 2016, 53(2): 467-478. CSTR: 32373.14.issn1000-1239.2016.20148281

基于MapReduce的多元连接优化方法

基金项目: 国家自然科学基金重大项目(61433008);国家自然科学基金青年基金项目(61202088);国家博士后科学基金面上项目(2013M540232);中央高校基本科研业务费专项基金项目(N120817001);教育部高等学校博士学科点博导基金项目(20120042110028)
详细信息
  • 中图分类号: TP393

Multi-Way Join Optimization Approach Based on MapReduce

  • 摘要: 多元连接是数据分析最常用的操作之一,MapReduce是广泛用于大规模数据分析处理的编程模型,它给多元连接优化带来新的挑战:传统的优化方法不能简单地适用到MapReduce中;MapReduce连接执行算法尚存优化空间.针对前者,考虑到I/O代价是连接运算的主要代价,首先以降低I/O代价为目标提出一种启发式算法确定多元连接执行顺序,并在此基础上进一步优化,最后针对MapReduce设计一种并行执行策略提高多元连接的整体性能.针对后者,考虑到负载均衡能够有效减少MapReduce的“木桶效应”,通过任务公平分配算法提高连接内部的并行度,并在此基础上给出Reduce任务个数的确定方法.最后,通过实验验证本文提出的执行计划确定方法以及负载均衡算法的优化效果.该研究对大数据环境下MapReduce多元连接的应用具有指导意义,可以优化如OLAP分析中的星型连接、社交网络中社团发现的链式连接等应用的性能.
    Abstract: Multi-way join is one of the most common data analysis operations, and MapReduce programming model that has been widely used to process large scale data sets has brought new challenges to multi-way join optimization. Traditional optimization approaches cannot be simply adapted to fit MapReduce feature, so there is still optimization room for MapReduce join algorithm. As to the former, we think I/O is the main cost of join. This paper first proposes an I/O cost based heuristic algorithm to initially determine a join sequence, and conducts further optimization. After the optimization, we also design a parallel execution algorithm to improve the whole performance of multi-way join. As to the latter, we think load balancing can effectively decrease the “buckets effect” of MapReduce. This paper proposes a fair task load allocation algorithm to improve the intra-join parallelism, and also analyzes the method to decide the appropriate number of Reduce tasks. Experiments verify the effectiveness of the proposed optimization approaches. This study contributes to multi-way join applications in big data environment, such as the star-join in OLAP and the chain-join in social network.
计量
  • 文章访问数:  1315
  • HTML全文浏览量:  0
  • PDF下载量:  613
  • 被引次数: 0
出版历程
  • 发布日期:  2016-01-31

目录

    /

    返回文章
    返回