高级检索
    陆克中, 朱金彬, 李正民, 隋秀峰. 面向固态硬盘的Spark数据持久化方法设计[J]. 计算机研究与发展, 2017, 54(6): 1381-1390. DOI: 10.7544/issn1000-1239.2017.20170108
    引用本文: 陆克中, 朱金彬, 李正民, 隋秀峰. 面向固态硬盘的Spark数据持久化方法设计[J]. 计算机研究与发展, 2017, 54(6): 1381-1390. DOI: 10.7544/issn1000-1239.2017.20170108
    Lu Kezhong, Zhu Jinbin, Li Zhengmin, Sui Xiufeng. Design of RDD Persistence Method in Spark for SSDs[J]. Journal of Computer Research and Development, 2017, 54(6): 1381-1390. DOI: 10.7544/issn1000-1239.2017.20170108
    Citation: Lu Kezhong, Zhu Jinbin, Li Zhengmin, Sui Xiufeng. Design of RDD Persistence Method in Spark for SSDs[J]. Journal of Computer Research and Development, 2017, 54(6): 1381-1390. DOI: 10.7544/issn1000-1239.2017.20170108

    面向固态硬盘的Spark数据持久化方法设计

    Design of RDD Persistence Method in Spark for SSDs

    • 摘要: 基于固态硬盘(solid-state drive, SSD)和硬盘(hard disk drive, HDD)混合存储的数据中心已经成为大数据计算领域的高性能载体,数据中心负载应该可将不同特性的数据按需持久化到SSD或HDD,以提升系统整体性能.Spark是目前产业界广泛使用的高效大数据计算框架,尤其适用于多次迭代计算的应用领域,其原因在于Spark可以将中间数据持久化在内存或硬盘中,且持久化数据到硬盘打破了内存容量不足对数据集规模的限制.然而,当前的Spark实现并未专门提供显式的面向SSD的持久化接口,尽管可根据配置信息将数据按比例分布到不同的存储介质中,但是用户无法根据数据特征按需指定RDD的持久化存储介质,针对性和灵活性不足.这不仅成为进一步提升Spark性能的瓶颈,而且严重影响了混合存储系统性能的发挥.有鉴于此,首次提出面向SSD的数据持久化策略.探索了Spark数据持久化原理,基于混合存储系统优化了Spark的持久化架构,最终通过提供特定的持久化API实现用户可显式、灵活指定RDD的持久化介质.基于SparkBench的实验结果表明,经本方案优化后的Spark与原生版本相比,其性能平均提升14.02%.

       

      Abstract: SSD (solid-state drive) and HDD (hard disk drive) hybrid storage system has been widely used in big data computing datacenters. The workloads should be able to persist data of different characteristics to SSD or HDD on demand to improve the overall performance of the system. Spark is an industry-wide efficient data computing framework, especially for the applications with multiple iterations. The reason is that Spark can persist data in memory or hard disk, and persisting data to the hard disk can break the insufficient memory limits on the size of the data set. However, the current Spark implementation does not specifically provide an explicit SSD-oriented persistence interface, although data can be distributed proportionally to different storage mediums based on configuration information, and the user can not specify RDD’s persistence locations according to the data characteristics, and thus the lack of relevance and flexibility. This has not only become a bottleneck to further enhance the performance of Spark, but also seriously affected the played performance of hybrid storage system. This paper presents the data persistence strategy for SSD for the first time as we know. We explore the data persistence principle in Spark, and optimize the architecture based on hybrid storage system. Finally, users can specify RDD’s storage mediums explicitly and flexibly leveraging the persistence API we provided. Experimental results based on SparkBench shows that the performance can be improved by an average of 14.02%.

       

    /

    返回文章
    返回