ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2014, Vol. 51 ›› Issue (12): 2711-2723.doi: 10.7544/issn1000-1239.2014.20131333

• 软件技术 • 上一篇    下一篇

MALK:一种高效处理大规模键值的MapReduce框架

郑亚松1,2,王达1,叶笑春1,崔慧敏1,徐远超1,3,范东睿1   

  1. 1(计算机体系结构国家重点实验室(中国科学院计算技术研究所) 北京 100190);2(中国科学院大学 北京 100049);3(首都师范大学信息工程学院 北京 100048) (zhengyasong@ict.ac.cn)
  • 出版日期: 2014-12-01
  • 基金资助: 
    基金项目:国家“九七三”重点基础研究发展计划基金项目(2011CB302501);国家杰出青年科学基金项目(60925009);国家自然科学基金项目(60921002,61173007,61100013,61100015,61202059,61202055);国家“八六三”高技术研究发展计划基金项目(2012AA012301,2012AA010303);北京市科技新星计划基金项目(2010B058);计算机体系结构国家重点实验室开放课题(CARCH201203)

MALK: A MapReduce Framework for High-Efficiently Processing Large Amount of Keys

Zheng Yasong1,2, Wang Da1,2, Ye Xiaochun1, Cui Huimin1, Xu Yuanchao1,3, Fan Dongrui1   

  1. 1(State Key Laboratory of Computer Architecture (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190); 2(University of Chinese Academy of Sciences, Beijing 100049); 3(College of Information Engineering, Capital Normal University, Beijing 100048)
  • Online: 2014-12-01

摘要: 内存申请是引发共享存储系统上MapReduce性能下降的主要瓶颈之一,特别是对于需要处理大量键值的应用尤为严重.为了解决此问题,提出了一种内存开销低、能高效处理大规模键值的MapReduce并行计算框架——MALK(high-efficient MapReduce for applications having large amount of keys).MALK对于离散的大规模键值采用连续的存储管理方法,避免了大量小块内存的申请;通过更细粒度地处理Map阶段的任务和流水化Reduce阶段的任务,来减少系统运行过程中同时活跃的数据量,从而将应用程序对内存的需求控制在一个较小的范围内;并提出一种Hash表的复用机制,通过复用Hash表的存储空间来避免流水过程中Hash表内存的重复申请;MALK还综合考虑了任务的粒度和数量对任务管理开销和整体性能的影响,把Reduce阶段的任务数量设成对系统性能最优的值.实验结果表明:相对于Phoenix++,MALK的性能最高可提升3.8倍(平均2.8倍);在Map和Reduce阶段,MALK最多可节省95.2%和87.8%的存储空间;MALK在Reduce阶段还取得了更好的负载均衡,降低了L2和LLC Cache的缺失率.

关键词: MapReduce, 面向具有大规模键值应用的MapReduce, 大规模键值, 共享存储多核系统, 内存申请

Abstract: The overhead of memory allocation is one of the major bottlenecks for shared-memory MapReduce, especially for the applications that have large amount of keys. In order to solve this problem, this paper presents a less memory consumption MapReduce, namely MALK, which can high-efficiently process applications with a large number of keys. Firstly, MALK succeeds in avoiding the constant allocations of massive small memory blocks by managing the discrete keys using contiguous area of storage. Secondly, MALK pipelines the process of Map-tasks and Reduce-tasks to decrease the active data in the system at the same time, and proposes a reusable mechanism of Hash table to reuse the memory space so as to avoid the memory reallocation of Hash table. What is more, MALK determines the suitable number of Reduce tasks, by evaluating the effect of task quantity and granularity on performance, to get optimal performance. The experiments show that, compared with Phoenix++, MALK achieves up to 3.8X higher speedup (average of 2.8X), and saves up to 95.2% memory in Map phase and 87.8% memory in Reduce phase. In addition, MALK reduces 30% waiting time with better load balance in Reduce phase, and cuts down more than 35% cache miss rate on average.

Key words: MapReduce, high-efficient MapReduce for applications having large amount of keys (MALK), large amount of keys, shared-memory multicore system, memory allocation

中图分类号: