Abstract:
Due to the boom of big data applications, the amount of data being processed by servers is increasing rapidly. In order to improve processing and response speed, industry is deploying in-memory big data computing systems, such as Apache Spark. However, traditional DRAM memory cannot satisfy the large memory request of these systems for the following reasons: firstly, the energy consumption of DRAM can be as high as 40% of the total; secondly, the scaling of DRAM manufacturing technology is hitting the limit. As a result, heterogeneous memory integrating DRAM and NVM (non-volatile memory) is a promising candidate for future memory systems. However, because of the longer latency and lower bandwidth of NVM compared with DRAM, it is necessary to place data in appropriate memory module to achieve ideal performance. This paper analyzes the memory access behavior of Spark applications and proposes a heterogeneous memory programming framework based on Spark. It is easy to apply this framework to existing Spark applications without rewriting the code. Experiments show that for Spark benchmarks, by utilizing our framework, only placing 20%~25% data on DRAM and the remaining on NVM can reach 90% of the performance when all the data is placed on DRAM. This leads to an improved performance-dollar ratio compared with DRAM-only servers and the potential support for larger scale in-memory computing applications.