面向大数据处理的基于Spark的异质内存编程框架

王晨曦; 吕方; 崔慧敏; 曹婷; John Zigman; 庄良吉; 冯晓兵

doi:10.7544/issn1000-1239.2018.20170687

面向大数据处理的基于Spark的异质内存编程框架

王晨曦^1,2,
吕方^1,4,
崔慧敏¹,
曹婷¹,
John Zigman³,
庄良吉^1,2,
冯晓兵^1,2

¹(计算机体系结构国家重点实验室(中国科学院计算技术研究所) 北京 100190)
²(中国科学院大学北京 100049)
³(澳大利亚野外机器人中心(悉尼大学) 澳大利亚悉尼 2006)
⁴(数学工程与先进计算国家重点实验室江苏无锡 214125) (wangchenxi@ict.ac.cn)

基金项目: 国家“八六三”高技术研究发展计划基金项目(2015AA011505，2015AA015306)；国家“九七三”重点基础研究计划基金项目(2016YFB1000402)；国家自然科学基金项目(61402445，61672492，61432016，61521092)；数学工程与先进计算国家重点实验室开放基金项目(2016A03)

详细信息

中图分类号: TP312
计量
- 文章访问数: 1362
- HTML全文浏览量: 4
- PDF下载量: 732
出版历程
- 发布日期: 2018-01-31

Heterogeneous Memory Programming Framework Based on Spark for Big Data Processing

¹(State Key Laboratory of Computer Architecture (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190)
²(University of Chinese Academy of Sciences, Beijing 100049)
³(Australia Centre for Field Robotics (University of Sydney), Sydney, Australia 2006)
⁴(State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, Jiangsu 214125)

摘要

摘要: 随着大数据应用的发展，需要处理的数据量急剧增长，企业为了保证数据的及时处理并快速响应客户，正在广泛部署以Apache Spark为代表的内存计算系统.然而TB级别的内存不但造成了服务器成本的上升，也促进了功耗的增长.由于DRAM的功耗、容量密度受限于工艺瓶颈，无法满足内存计算快速增长的内存需求，因此研发人员将目光逐渐移向了新型的非易失性内存(non-volatile memory, NVM).由DRAM和NVM共同构成的异质内存，具有低成本、低功耗、高容量密度等特点，但由于NVM读写性能较差，如何合理布局数据到异质内存是一个关键的研究问题.系统分析了Spark应用的访存特征，并结合OpenJDK的内存使用特点，提出了一套管理数据在DRAM和NVM之间布局的编程框架.应用开发者通过对本文提供接口的简单调用，便可将数据合理布局在异质内存之中.仅需20%~25%的DRAM和大量的NVM，便可以达到使用等量的DRAM时90%左右的性能.该框架可以通过有效利用异质内存来满足内存计算不断增长的计算规模.同时，“性能/价格”比仅用DRAM时提高了数倍.
- 内存计算 /
- Spark /
- 异质内存 /
- 非易失性内存 /
- 编程框架
Abstract: Due to the boom of big data applications, the amount of data being processed by servers is increasing rapidly. In order to improve processing and response speed, industry is deploying in-memory big data computing systems, such as Apache Spark. However, traditional DRAM memory cannot satisfy the large memory request of these systems for the following reasons: firstly, the energy consumption of DRAM can be as high as 40% of the total; secondly, the scaling of DRAM manufacturing technology is hitting the limit. As a result, heterogeneous memory integrating DRAM and NVM (non-volatile memory) is a promising candidate for future memory systems. However, because of the longer latency and lower bandwidth of NVM compared with DRAM, it is necessary to place data in appropriate memory module to achieve ideal performance. This paper analyzes the memory access behavior of Spark applications and proposes a heterogeneous memory programming framework based on Spark. It is easy to apply this framework to existing Spark applications without rewriting the code. Experiments show that for Spark benchmarks, by utilizing our framework, only placing 20%~25% data on DRAM and the remaining on NVM can reach 90% of the performance when all the data is placed on DRAM. This leads to an improved performance-dollar ratio compared with DRAM-only servers and the potential support for larger scale in-memory computing applications.
- in-memory computing /
- Spark /
- heterogeneous memory /
- non-volatile memory (NVM) /
- programming framework

HTML全文

参考文献(0)

施引文献(9)

期刊类型引用(4)

1.	王鑫，李瑞，兰蓝，白波，白伊玎. 北京市检查检验结果互认数据对接实践与思考. 中国卫生信息管理杂志. 2024(06): 838-843 . 百度学术
2.	高茂，张丽萍，侯敏，闫盛，赵宇博. 基于BERT的百科知识库实体对齐. 内蒙古师范大学学报(自然科学汉文版). 2023(06): 630-637 . 百度学术
3.	李翠华，高昭昇，刘玉转. 区域检验检查结果互认平台建设与应用探讨. 中国卫生信息管理杂志. 2022(06): 835-841 . 百度学术
4.	姚华彦，张鑫金，何萍. 基于大数据的患者画像标签体系构建方法及应用研究. 中国卫生信息管理杂志. 2019(06): 667-671 . 百度学术