支持高并发数据流处理的MapReduce中间结果缓存

亓开元; 韩燕波; 赵卓峰; 房  俊

支持高并发数据流处理的MapReduce中间结果缓存

MapReduce Intermediate Result Cache for Concurrent Data Stream Processing

摘要

摘要: 针对面向大规模历史数据的高并发数据流处理需求，为改进MapReduce的实时处理能力，提出了一种内存Hash B树、外存SSTable文件的keyvalue中间结果缓存，该结构具有可划分性、可扩展性和高效性.在此基础上，利用B树的平衡性特征提出了一种基于概率的B树构造算法和多路查询算法，利用读写开销估算和缓冲区信息改造了外存文件读写策略和内外存替换算法，进一步优化了中间结果的高并发读写性能.算法分析和实验证明了该缓存的有效性.

Abstract: With the development of Internet of Things applications, real-time processing of sensor data stream over large scale history data brings a new challenge. The traditional MapReduce programming model is designed for batch-based large-scale data processing and cannot satisfy the real-time requirement. To extend the real-time data processing capability of MapReduce by preprocessing, pipelining and localizing, an immediate result cache for keyvalue data type, which can avoid repeated remote IO overhead and computation cost by taking full use of local memory and storage, localize stream processing by distributing data across the clusters and support frequent reads and writes of data stream processing, needs to be designed. This paper proposes a scalable, extensible and efficient keyvalue intermediate result cache, which consists of Hash B-tree structures and SSTable files. Furthermore, to optimize the high concurrency performance, this paper also devises a probability-based B-tree structure as well as its multiplexing search algorithm through the B-tree balance property, and improves the file readwrite strategy and replacement algorithm by utilization of the overhead estimation and buffered information. The theoretical analysis and benchmark experiments show that the proposed structures and algorithms further optimize the concurrency performance of MapReduce immediate results, and the immediate result cache is effective to support data stream processing over large-scale data.

HTML全文

参考文献(0)

施引文献

资源附件(0)