ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2020, Vol. 57 ›› Issue (6): 1179-1190.doi: 10.7544/issn1000-1239.2020.20200109

所属专题: 2020计算机体系结构前沿技术专题

• 系统结构 • 上一篇    下一篇

基于Spark的大数据访存行为跨层分析工具

许丹亚1,王晶1,2,王利3,张伟功2,3   

  1. 1(首都师范大学信息工程学院 北京 100048);2(高可靠嵌入式技术北京市工程研究中心(首都师范大学) 北京 100048);3(北京成像理论与技术高精尖创新中心(首都师范大学) 北京 100048) (xudanya@cnu.edu.cn)
  • 出版日期: 2020-06-01
  • 基金资助: 
    国家自然科学基金项目(61772350);北京市科技新星计划(Z181100006218093);北京未来芯片技术高精尖创新中心科研基金项目(KYJJ2018008);北京市高水平教师队伍建设计划(CIT&TCD201704082);科技创新服务能力建设-基本科研业务费(科研类)(19530050173)

A Cross-Layer Memory Tracing Toolkit for Big Data Application Based on Spark

Xu Danya1, Wang Jing1,2, Wang Li3, Zhang Weigong2,3   

  1. 1(Information Engineering College, Capital Normal University, Beijing 100048);2(Beijing Engineering Research Center of High Reliable Embedded System (Capital Normal University), Beijing 100048);3(Beijing Advanced Innovation Center for Imaging Theory and Technology (Capital Normal University), Beijing 100048)
  • Online: 2020-06-01
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (61772350), the Beijing Nova Program (Z181100006218093), the Research Fund from Beijing Innovation Center for Future Chips (KYJJ2018008), the Construction Plan of Beijing High-level Teacher Team (CIT&TCD201704082), and the Capacity Building for Sci-Tech Innovation Fundamental Scientific Research Funds (19530050173).

摘要: 大数据时代的到来为信息处理带来了新的挑战,内存计算方式的Spark显著提高了数据处理的性能.Spark的性能优化和分析可以在应用层、系统层和硬件层开展,然而现有工作都只局限在某一层,使得Spark语义与底层动作脱离,如操作系统参数对Spark应用层的性能影响的缺失将使得大量灵活的操作系统配置参数无法发挥作用.针对上述问题,设计了Spark存储系统分析工具SMTT,打通了Spark层、JVM层和OS层,建立了上层应用程序的语义与底层物理内存信息的联系.SMTT针对Spark内存特点,分别设计了针对执行内存和存储内存的追踪方式.基于SMTT工具完成了对Spark迭代计算过程内存使用,以及跨越Spark,JVM和OS层的执行/存储内存使用过程的分析,并以RDD为例通过SMTT分析了单节点和多节点情况下Spark中读和写操作比例,结果表明该工作为Spark内存系统的性能分析和优化提供了有力的支持.

关键词: 大数据, Spark, 内存管理, 跨层分析, 内存追踪

Abstract: Spark has been increasingly employed by industries for big data analytics recently, due to its efficient in-memory distributed programming model. Most existing optimization and analysis tool of Spark perform at either application layer or operating system layer separately, which makes Spark semantics separate from the underlying actions. For example, unknowing the impaction of operating system parameters on performance of Spark layer will lead unknowing of how to use OS parameters to tune system performance. In this paper, we propose SMTT, a new Spark memory tracing toolkit, which establishes the semantics of the upper application and the underlying physical hardware across Spark layer, JVM layer and OS layer. Based on the characteristics of Spark memory, we design the tracking scheme of execution memory and storage memory respectively. Then we analyze the Spark iterative calculation process and execution/storage memory usage by SMTT. The experiment of RDD memory assessment analysis shows our toolkit could be effectively used on performance analysis and provide guides for optimization of Spark memory system.

Key words: big data, Spark, memory management, cross-layer analysis, memory tracing

中图分类号: