面向大数据处理的基于Spark的异质内存编程框架

王晨曦; 吕方; 崔慧敏; 曹婷; John Zigman; 庄良吉; 冯晓兵

doi:10.7544/issn1000-1239.2018.20170687

面向大数据处理的基于Spark的异质内存编程框架

Heterogeneous Memory Programming Framework Based on Spark for Big Data Processing

摘要

摘要: 随着大数据应用的发展，需要处理的数据量急剧增长，企业为了保证数据的及时处理并快速响应客户，正在广泛部署以Apache Spark为代表的内存计算系统.然而TB级别的内存不但造成了服务器成本的上升，也促进了功耗的增长.由于DRAM的功耗、容量密度受限于工艺瓶颈，无法满足内存计算快速增长的内存需求，因此研发人员将目光逐渐移向了新型的非易失性内存(non-volatile memory, NVM).由DRAM和NVM共同构成的异质内存，具有低成本、低功耗、高容量密度等特点，但由于NVM读写性能较差，如何合理布局数据到异质内存是一个关键的研究问题.系统分析了Spark应用的访存特征，并结合OpenJDK的内存使用特点，提出了一套管理数据在DRAM和NVM之间布局的编程框架.应用开发者通过对本文提供接口的简单调用，便可将数据合理布局在异质内存之中.仅需20%~25%的DRAM和大量的NVM，便可以达到使用等量的DRAM时90%左右的性能.该框架可以通过有效利用异质内存来满足内存计算不断增长的计算规模.同时，“性能/价格”比仅用DRAM时提高了数倍.

Abstract: Due to the boom of big data applications, the amount of data being processed by servers is increasing rapidly. In order to improve processing and response speed, industry is deploying in-memory big data computing systems, such as Apache Spark. However, traditional DRAM memory cannot satisfy the large memory request of these systems for the following reasons: firstly, the energy consumption of DRAM can be as high as 40% of the total; secondly, the scaling of DRAM manufacturing technology is hitting the limit. As a result, heterogeneous memory integrating DRAM and NVM (non-volatile memory) is a promising candidate for future memory systems. However, because of the longer latency and lower bandwidth of NVM compared with DRAM, it is necessary to place data in appropriate memory module to achieve ideal performance. This paper analyzes the memory access behavior of Spark applications and proposes a heterogeneous memory programming framework based on Spark. It is easy to apply this framework to existing Spark applications without rewriting the code. Experiments show that for Spark benchmarks, by utilizing our framework, only placing 20%~25% data on DRAM and the remaining on NVM can reach 90% of the performance when all the data is placed on DRAM. This leads to an improved performance-dollar ratio compared with DRAM-only servers and the potential support for larger scale in-memory computing applications.

HTML全文

参考文献(0)

施引文献

资源附件(0)