ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2017, Vol. 54 ›› Issue (6): 1381-1390.doi: 10.7544/issn1000-1239.2017.20170108

Special Issue: 2017计算机体系结构前言技术(一)专题

Previous Articles     Next Articles

Design of RDD Persistence Method in Spark for SSDs

Lu Kezhong1, Zhu Jinbin2,4, Li Zhengmin3, Sui Xiufeng4,5   

  1. 1(College of Computer Science & Software Engineering, Shenzhen University, Shenzhen, Guangdong 518060); 2(School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 511400); 3(National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100029); 4(State Key Laboratory of Computer Architecture (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190); 5(Strategic Studies Centre, Chinese Academy of Engineering, Beijing 100088)
  • Online:2017-06-01

Abstract: SSD (solid-state drive) and HDD (hard disk drive) hybrid storage system has been widely used in big data computing datacenters. The workloads should be able to persist data of different characteristics to SSD or HDD on demand to improve the overall performance of the system. Spark is an industry-wide efficient data computing framework, especially for the applications with multiple iterations. The reason is that Spark can persist data in memory or hard disk, and persisting data to the hard disk can break the insufficient memory limits on the size of the data set. However, the current Spark implementation does not specifically provide an explicit SSD-oriented persistence interface, although data can be distributed proportionally to different storage mediums based on configuration information, and the user can not specify RDD’s persistence locations according to the data characteristics, and thus the lack of relevance and flexibility. This has not only become a bottleneck to further enhance the performance of Spark, but also seriously affected the played performance of hybrid storage system. This paper presents the data persistence strategy for SSD for the first time as we know. We explore the data persistence principle in Spark, and optimize the architecture based on hybrid storage system. Finally, users can specify RDD’s storage mediums explicitly and flexibly leveraging the persistence API we provided. Experimental results based on SparkBench shows that the performance can be improved by an average of 14.02%.

Key words: big data, hybrid storage, solid-state drive (SSD), Spark, persistence

CLC Number: