基于MapReduce模型的范围查询分析优化技术研究

赵  辉  杨树强  陈志坤  尹  洪  金松昌

基于MapReduce模型的范围查询分析优化技术研究

赵辉杨树强陈志坤尹洪金松昌

Optimization of Range Queries and Analysis for MapReduce Systems

Zhao Hui, Yang Shuqiang, Chen Zhikun, Yin Hong, and Jin Songchang

摘要

摘要: 近年来，MapReduce并行计算模型受到工业界和学术界广泛关注.基于该模型的系统实现已在谷歌、雅虎、Facebook等大公司内部成功应用.然而，基于MapReduce的系统实现最初用于解决海量无结构、半结构化数据的批处理问题，例如生成倒排索引、计算网页的pagerank、日志分析等，在设计上缺乏针对海量结构化数据进行交互式分析处理的优化考虑，例如：它总是采用全数据集强力扫描的数据处理模式，这有悖于结构化数据管理中常用的操作模式——选择性查询分析处理.针对该问题，引入传统数据库管理领域中常用的全局索引技术，将其应用在基于MapReduce模型的开源项目Hadoop上，以block为粒度对Hadoop分布式文件系统上的结构化数据构建全局索引结构，并给出一种面向范围查询分析的作业编译与调度执行优化算法，主要目标是基于应用语义及辅助索引结构减少不必要的map任务数，进而优化作业的调度开销和执行开销.在实验验证阶段，给出了80%，50%，30%，10%四种数据选择率在3种集群规模下的优化效果，发现作业响应时间最高可提升5倍，I/O开销最高提升10倍，任务调度开销最高提升11倍.

Abstract: Recently, MapReduce parallel computing paradigm has gained extensive attention from industry and academia. MapReduce works well in Google, Yahoo! and Facebook on massive data processing. However, MapReduce-based systems originally were used to manage massive un-structured and semi-structured data, such as inverted indexing, Web page ranking, log analyzing etc. They ignored the optimizing of structured data, such as the brute-force scanning, which is inefficient for some common workloads in structured data management, such as select, filter etc. For this problem, we introdue a global indexing technology, which has been widely used in database, aiming to optimizing queries and analysis in a range of the overall dataset. Global index will help reduce redundant map tasks, resulting in decreasing the cost of I/O and scheduling. Finally, we evaluate the effect of our framework by four data selection ratios which are 80%, 50%, 30% and 10% under different cluster sizes. We find that the response time has 5x improvement at most, I/O cost improves 10x at most and cost of scheduling improves 11x at most.

HTML全文

参考文献(0)

施引文献

资源附件(0)