MapReduce Intermediate Result Cache for Concurrent Data Stream Processing
-
Graphical Abstract
-
Abstract
With the development of Internet of Things applications, real-time processing of sensor data stream over large scale history data brings a new challenge. The traditional MapReduce programming model is designed for batch-based large-scale data processing and cannot satisfy the real-time requirement. To extend the real-time data processing capability of MapReduce by preprocessing, pipelining and localizing, an immediate result cache for keyvalue data type, which can avoid repeated remote IO overhead and computation cost by taking full use of local memory and storage, localize stream processing by distributing data across the clusters and support frequent reads and writes of data stream processing, needs to be designed. This paper proposes a scalable, extensible and efficient keyvalue intermediate result cache, which consists of Hash B-tree structures and SSTable files. Furthermore, to optimize the high concurrency performance, this paper also devises a probability-based B-tree structure as well as its multiplexing search algorithm through the B-tree balance property, and improves the file readwrite strategy and replacement algorithm by utilization of the overhead estimation and buffered information. The theoretical analysis and benchmark experiments show that the proposed structures and algorithms further optimize the concurrency performance of MapReduce immediate results, and the immediate result cache is effective to support data stream processing over large-scale data.
-
-