ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2018, Vol. 55 ›› Issue (2): 246-264.doi: 10.7544/issn1000-1239.2018.20170687

所属专题: 2018面向新型硬件的数据管理专题

• 软件技术 • 上一篇    下一篇

面向大数据处理的基于Spark的异质内存编程框架

王晨曦1,2, 吕方1,4, 崔慧敏1, 曹婷1, JohnZigman3, 庄良吉1,2, 冯晓兵1,2   

  1. 1(计算机体系结构国家重点实验室(中国科学院计算技术研究所) 北京 100190); 2(中国科学院大学 北京 100049); 3(澳大利亚野外机器人中心(悉尼大学) 澳大利亚悉尼 2006); 4(数学工程与先进计算国家重点实验室 江苏无锡 214125) (wangchenxi@ict.ac.cn)
  • 出版日期: 2018-02-01
  • 基金资助: 
    国家“八六三”高技术研究发展计划基金项目(2015AA011505,2015AA015306);国家“九七三”重点基础研究计划基金项目(2016YFB1000402);国家自然科学基金项目(61402445,61672492,61432016,61521092);数学工程与先进计算国家重点实验室开放基金项目(2016A03)

Heterogeneous Memory Programming Framework Based on Spark for Big Data Processing

Wang Chenxi1,2, Lü Fang1,4, Cui Huimin1, Cao Ting1, John Zigman3, Zhuang Liangji1,2, Feng Xiaobing1,2   

  1. 1(State Key Laboratory of Computer Architecture (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190); 2(University of Chinese Academy of Sciences, Beijing 100049); 3(Australia Centre for Field Robotics (University of Sydney), Sydney, Australia 2006); 4(State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, Jiangsu 214125)
  • Online: 2018-02-01

摘要: 随着大数据应用的发展,需要处理的数据量急剧增长,企业为了保证数据的及时处理并快速响应客户,正在广泛部署以Apache Spark为代表的内存计算系统.然而TB级别的内存不但造成了服务器成本的上升,也促进了功耗的增长.由于DRAM的功耗、容量密度受限于工艺瓶颈,无法满足内存计算快速增长的内存需求,因此研发人员将目光逐渐移向了新型的非易失性内存(non-volatile memory, NVM).由DRAM和NVM共同构成的异质内存,具有低成本、低功耗、高容量密度等特点,但由于NVM读写性能较差,如何合理布局数据到异质内存是一个关键的研究问题.系统分析了Spark应用的访存特征,并结合OpenJDK的内存使用特点,提出了一套管理数据在DRAM和NVM之间布局的编程框架.应用开发者通过对本文提供接口的简单调用,便可将数据合理布局在异质内存之中.仅需20%~25%的DRAM和大量的NVM,便可以达到使用等量的DRAM时90%左右的性能.该框架可以通过有效利用异质内存来满足内存计算不断增长的计算规模.同时,“性能/价格”比仅用DRAM时提高了数倍.

关键词: 内存计算, Spark, 异质内存, 非易失性内存, 编程框架

Abstract: Due to the boom of big data applications, the amount of data being processed by servers is increasing rapidly. In order to improve processing and response speed, industry is deploying in-memory big data computing systems, such as Apache Spark. However, traditional DRAM memory cannot satisfy the large memory request of these systems for the following reasons: firstly, the energy consumption of DRAM can be as high as 40% of the total; secondly, the scaling of DRAM manufacturing technology is hitting the limit. As a result, heterogeneous memory integrating DRAM and NVM (non-volatile memory) is a promising candidate for future memory systems. However, because of the longer latency and lower bandwidth of NVM compared with DRAM, it is necessary to place data in appropriate memory module to achieve ideal performance. This paper analyzes the memory access behavior of Spark applications and proposes a heterogeneous memory programming framework based on Spark. It is easy to apply this framework to existing Spark applications without rewriting the code. Experiments show that for Spark benchmarks, by utilizing our framework, only placing 20%~25% data on DRAM and the remaining on NVM can reach 90% of the performance when all the data is placed on DRAM. This leads to an improved performance-dollar ratio compared with DRAM-only servers and the potential support for larger scale in-memory computing applications.

Key words: in-memory computing, Spark, heterogeneous memory, non-volatile memory (NVM), programming framework

中图分类号: