ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2018, Vol. 55 ›› Issue (2): 229-245.doi: 10.7544/issn1000-1239.2018.20170742

所属专题: 2018面向新型硬件的数据管理专题

• 软件技术 • 上一篇    下一篇

NV-Shuffle:基于非易失内存的Shuffle机制

潘锋烽, 熊劲   

  1. (计算机体系结构国家重点实验室(中国科学院计算技术研究所) 北京 100190) (中国科学院大学 北京 100049) (panfengfeng@ict.ac.cn)
  • 出版日期: 2018-02-01
  • 基金资助: 
    国家重点研发计划项目(2016YFB1000202);国家自然科学基金项目(61379042)

NV-Shuffle: Shuffle Based on Non-Volatile Memory

Pan Fengfeng, Xiong Jin   

  1. (State Key Laboratory of Computer Architecture (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190) (University of Chinese Academy of Sciences, Beijing 100049)
  • Online: 2018-02-01

摘要: Shuffle是大数据处理过程中一个极为重要的阶段.不同类型的Task(或者Stage)之间通过Shuffle进行数据交换.在Shuffle过程中数据需要进行持久化,以达到避免重计算和容错的目的.因此Shuffle的性能是决定大数据处理性能的关键因素之一.由于传统Shuffle阶段的数据通过磁盘文件系统进行持久化,所以影响Shuffle性能的一个重要因素是I/O开销,尤其是对基于内存计算的大数据处理平台,例如Spark,Shuffle阶段的磁盘I/O可能拖延数据处理的时间.而非易失内存(NVM)具有读写速度快、非易失性以及高密度性等诸多优点,它们为改变大数据处理过程中对磁盘I/O的依赖、克服目前基于内存计算的大数据处理中的I/O性能瓶颈提供了新机会.提出一种基于NVM的Shuffle优化策略——NV-Shuffle.NV-Shuffle摒弃了传统Shuffle阶段采用文件系统的存储方式,而使用类似于Memory访问的方式进行Shuffle数据的存储与管理,避免了文件系统的开销,并充分发挥NVM的优势,从而减少Shuffle阶段的耗时.在Spark平台上实现了NV-Shuffle,实验结果显示,对于Shuffle-heavy类型的负载,NV-Shuffle可节省大约10%~40%的执行时间.

关键词: 大数据处理, Shuffle, 非易失内存, 非易失缓冲区, 容错

Abstract: In the popular big data processing platforms like Spark, it is common to collect data in a many-to-many fashion during a stage traditionally known as the Shuffle phase. Data exchange happens across different types of tasks or stages via Shuffle phase. And during this phase, the data need to be transferred via network and persisted into traditional disk-based file system. Hence, the efficiency of Shuffle phase is one of the key factors in the performance of the big data processing. In order to reducing I/O overheads, we propose an optimized Shuffle strategy based on Non-Volatile Memory (NVM)—NV-Shuffle. Next-generation non-volatile memory (NVM) technologies, such as Phase Change Memory (PCM), Spin-Transfer Torque Magnetic Memories (STTMs) introduce new opportunities for reducing I/O overhead, due to their non-volatility, high read/write performance, low energy, etc. In the big data processing platform based on memory computing such as Spark, Shuffle data access based on disks is an important factor of application performance, NV-Shuffle uses NVM as persist memory to store Shuffle data and employs direct data accesses like memory by introducing NV-Buffer to organize data instead of traditional file system.We implemented NV-Shuffle in Spark. Our performance results show, NV-shuffle reduces job execution time by 10%~40% for Shuffle-heavy workloads.

Key words: big data processing, Shuffle, NVM(non-volatile memory), non-volatile buffer, fault tolerance

中图分类号: