ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (5): 1061-1070.doi: 10.7544/issn1000-1239.2015.20140693

• 软件技术 • 上一篇    下一篇

基于列存储的大数据分析系统物化策略研究

张滨1,2,乐嘉锦1,孙莉1,夏小玲1,王梅1,李晔锋1   

  1. 1(东华大学计算机科学与技术学院 上海 201620); 2(浙江财经大学 杭州 310018) (dzhangbin@gmail.com)
  • 出版日期: 2015-05-01
  • 基金资助: 
    基金项目:国家自然科学基金项目(61103046);中央高校基本科研业务费专项资金项目(东华大学“励志计划”项目(B201312));浙江省教育厅科研基金项目(Y201225326,Y201432374)

Materialization Strategies in Big Data Analysis System Based on Column-Store

Zhang Bin1,2, Le Jiajin1, Sun Li1, Xia Xiaoling1, Wang Mei1, Li Yefeng1   

  1. 1(College of Computer Science and Technology, Donghua University, Shanghai 201620); 2(Zhejiang University of Finance & Economics, Hangzhou 310018)
  • Online: 2015-05-01

摘要: 大数据具有规模大、深度大、宽度大、处理时间短、硬件系统普通化和软件系统开源化特点.针对当前传统数据库在对大数据进行分析时系统性能严重下降、计算效率提升有限的问题,提出一种基于列存储的大数据分析系统物化策略(materialization strategies in MapReduce based on column-store, MSMC).首先,通过引入MapReduce物化代价估计模型,深入分析影响物化效率的各个因素.在此基础上设计了MapReduce分布式环境下的列存储文件格式(MapReduce column-store file, MCF),并在数据加载过程中采用协同定位策略实现对物化数据的存储优化.其次,分别针对不同的物化时机,构建了MapReduce早期物化策略(MapReduce early materialization strategy, MEMS)、MapReduce延迟物化策略(MapReduce late materialization strategy, MLMS)和MapReduce混合物化策略(MapReduce early-late materialization strategy, MELMS).利用自适应物化调整策略对其做了进一步优化.实验结果在证明算法有效的同时,也显示出算法在存储空间和负载能力上都有很好的表现.

关键词: 大数据, 列存储, 物化策略, MapReduce, 分析系统

Abstract: The characters of big data are volume, variety, velocity, common hardware and open source. In traditional relational database, materialization can speed up query processing greatly. However, modern big data analysis faces a confluence of growing challenges that systems become more and more inefficiently and scalability. Consequently, this paper presents some materialization strategies based on column-store to provide an effective environment for big data analysis. Firstly, it analyzes the impact of materialization efficiency by MapReduce cost model. Secondly, it designs the MapReduce column-store File, and achieves optimization by cooperative localization strategy. Fourthly, according to the different materialization time window, it proposes materialization strategies in MapReduce based on column-store (MSMC), which is composed of three strategies: MapReduce early materialization strategy (MEMS), MapReduce late materialization strategy (MLMS) and MapReduce early-late materialization strategy (MELMS). Thirdly, for the sake of avoiding malignant expansion of materialization sets, it designs the adaptive materialization sets adjust strategy(AMSAS), which realizes the optimization of MSMC effectively. Finally, the experiments are conducted to evaluate execution time and load capacity. The results reveal that the materialization strategies in MapReduce based on column-store and adaptive materialized set adjustment strategy can effectively reduce the intermediate data process of MapReduce, network bandwidth and unnecessary I/O. It verifies the effectiveness of the proposed method in big data analysis.

Key words: big data, column-store, materialization strategy(MS), MapReduce, analysis system

中图分类号: