ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2017, Vol. 54 ›› Issue (6): 1381-1390.doi: 10.7544/issn1000-1239.2017.20170108

所属专题: 2017计算机体系结构前言技术(一)专题

• 系统结构 • 上一篇    下一篇

面向固态硬盘的Spark数据持久化方法设计

陆克中1,朱金彬2,4,李正民3,隋秀峰4,5   

  1. 1(深圳大学计算机与软件学院 广东深圳 518060); 2(广东工业大学计算机学院 广州 511400); 3(国家计算机网络应急技术处理协调中心 北京 100029); 4(计算机体系结构国家重点实验室 (中国科学院计算技术研究所) 北京 100190); 5(中国工程院战略咨询中心 北京 100088) (kzlu@szu.edu.cn)
  • 出版日期: 2017-06-01
  • 基金资助: 
    国家“八六三”高技术研究发展计划基金项目(2015AA015305);广东省自然科学基金项目(2014A030313553);广东省省部产学研项目(2013B090500055);深圳市基础研究学科布局项目(JCYJ20150529164656096)

Design of RDD Persistence Method in Spark for SSDs

Lu Kezhong1, Zhu Jinbin2,4, Li Zhengmin3, Sui Xiufeng4,5   

  1. 1(College of Computer Science & Software Engineering, Shenzhen University, Shenzhen, Guangdong 518060); 2(School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 511400); 3(National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100029); 4(State Key Laboratory of Computer Architecture (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190); 5(Strategic Studies Centre, Chinese Academy of Engineering, Beijing 100088)
  • Online: 2017-06-01

摘要: 基于固态硬盘(solid-state drive, SSD)和硬盘(hard disk drive, HDD)混合存储的数据中心已经成为大数据计算领域的高性能载体,数据中心负载应该可将不同特性的数据按需持久化到SSD或HDD,以提升系统整体性能.Spark是目前产业界广泛使用的高效大数据计算框架,尤其适用于多次迭代计算的应用领域,其原因在于Spark可以将中间数据持久化在内存或硬盘中,且持久化数据到硬盘打破了内存容量不足对数据集规模的限制.然而,当前的Spark实现并未专门提供显式的面向SSD的持久化接口,尽管可根据配置信息将数据按比例分布到不同的存储介质中,但是用户无法根据数据特征按需指定RDD的持久化存储介质,针对性和灵活性不足.这不仅成为进一步提升Spark性能的瓶颈,而且严重影响了混合存储系统性能的发挥.有鉴于此,首次提出面向SSD的数据持久化策略.探索了Spark数据持久化原理,基于混合存储系统优化了Spark的持久化架构,最终通过提供特定的持久化API实现用户可显式、灵活指定RDD的持久化介质.基于SparkBench的实验结果表明,经本方案优化后的Spark与原生版本相比,其性能平均提升14.02%.

关键词: 大数据, 混合存储, 固态硬盘, Spark, 持久化

Abstract: SSD (solid-state drive) and HDD (hard disk drive) hybrid storage system has been widely used in big data computing datacenters. The workloads should be able to persist data of different characteristics to SSD or HDD on demand to improve the overall performance of the system. Spark is an industry-wide efficient data computing framework, especially for the applications with multiple iterations. The reason is that Spark can persist data in memory or hard disk, and persisting data to the hard disk can break the insufficient memory limits on the size of the data set. However, the current Spark implementation does not specifically provide an explicit SSD-oriented persistence interface, although data can be distributed proportionally to different storage mediums based on configuration information, and the user can not specify RDD’s persistence locations according to the data characteristics, and thus the lack of relevance and flexibility. This has not only become a bottleneck to further enhance the performance of Spark, but also seriously affected the played performance of hybrid storage system. This paper presents the data persistence strategy for SSD for the first time as we know. We explore the data persistence principle in Spark, and optimize the architecture based on hybrid storage system. Finally, users can specify RDD’s storage mediums explicitly and flexibly leveraging the persistence API we provided. Experimental results based on SparkBench shows that the performance can be improved by an average of 14.02%.

Key words: big data, hybrid storage, solid-state drive (SSD), Spark, persistence

中图分类号: