ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2014, Vol. 51 ›› Issue (12): 2711-2723.doi: 10.7544/issn1000-1239.2014.20131333

Previous Articles     Next Articles

MALK: A MapReduce Framework for High-Efficiently Processing Large Amount of Keys

Zheng Yasong1,2, Wang Da1,2, Ye Xiaochun1, Cui Huimin1, Xu Yuanchao1,3, Fan Dongrui1   

  1. 1(State Key Laboratory of Computer Architecture (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190); 2(University of Chinese Academy of Sciences, Beijing 100049); 3(College of Information Engineering, Capital Normal University, Beijing 100048)
  • Online:2014-12-01

Abstract: The overhead of memory allocation is one of the major bottlenecks for shared-memory MapReduce, especially for the applications that have large amount of keys. In order to solve this problem, this paper presents a less memory consumption MapReduce, namely MALK, which can high-efficiently process applications with a large number of keys. Firstly, MALK succeeds in avoiding the constant allocations of massive small memory blocks by managing the discrete keys using contiguous area of storage. Secondly, MALK pipelines the process of Map-tasks and Reduce-tasks to decrease the active data in the system at the same time, and proposes a reusable mechanism of Hash table to reuse the memory space so as to avoid the memory reallocation of Hash table. What is more, MALK determines the suitable number of Reduce tasks, by evaluating the effect of task quantity and granularity on performance, to get optimal performance. The experiments show that, compared with Phoenix++, MALK achieves up to 3.8X higher speedup (average of 2.8X), and saves up to 95.2% memory in Map phase and 87.8% memory in Reduce phase. In addition, MALK reduces 30% waiting time with better load balance in Reduce phase, and cuts down more than 35% cache miss rate on average.

Key words: MapReduce, high-efficient MapReduce for applications having large amount of keys (MALK), large amount of keys, shared-memory multicore system, memory allocation

CLC Number: