高级检索

    Hadoop MapReduce短作业执行性能优化

    Performance Optimization for Short Job Execution in Hadoop MapReduce

    • 摘要: Hadoop MapReduce并行计算框架被广泛应用于大规模数据并行处理.近年来,由于其能较好地处理大规模数据,Hadoop MapReduce也被越来越多地使用在查询应用中.为了能够处理大规模数据集,Hadoop的基本设计更多地强调了数据的高吞吐率.然而在处理对短作业响应性能有较高要求的查询应用时,Hadoop MapReduce并行计算框架存在明显不足.为了提升Hadoop对于短作业的执行效率,对原有的Hadoop MapReduce作出以下3点优化:1)通过优化原有的setup和cleanup任务的执行方式,成功地缩短了作业初始化环境准备和作业结束环境清理的时间;2)将首次任务分配从“拉”模式转变为“推”模式;3)将作业执行过程中JobTracker和TaskTrackers之间的控制消息通信从现有的周期性心跳机制中分离出来,采用即时传递机制.最后,采用一种典型的基于MapReduce并行化的查询应用BLAST,对优化工作进行了评估.各种不同类型BLAST作业的测试实验表明,与现有的标准Hadoop相比,优化后的Hadoop平均执行性能提升约23%.

       

      Abstract: Hadoop MapReduce is a widely used parallel computing framework for solving data-intensive problems. Now days, for its good capability for processing large scale data, Hadoop MapReduce has also been adopted in many query applications. To be able to process large scale datasets, the fundamental design of the standard Hadoop places more emphasis on the high-throughput of data than on the job execution performance. This causes performance limitation when we use Hadoop MapReduce to execute short jobs. This paper proposes several optimization methods to improve the execution performance of MapReduce jobs, especially for short jobs. We make three major optimizations: 1) reduce the time cost during the initialization and termination stages of a job by optimizing its setup and cleanup tasks; 2) change the assignment model of the first batch of tasks from the pull model to the push model; 3) replace the heartbeat-base communication mechanism with an instant message communication mechanism for event notifications between the JobTracker and TaskTrackers. We also adopt a typical MapReduce-based parallel query application, BLAST, to evaluate the effects of our optimizations. Experimental results show that the job execution performance of our improved version of Hadoop is about 23% faster on average than the standard Hadoop for different types of BLAST MapReduce jobs.

       

    /

    返回文章
    返回