ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2020, Vol. 57 ›› Issue (6): 1179-1190.doi: 10.7544/issn1000-1239.2020.20200109

Special Issue: 2020计算机体系结构前沿技术专题

Previous Articles     Next Articles

A Cross-Layer Memory Tracing Toolkit for Big Data Application Based on Spark

Xu Danya1, Wang Jing1,2, Wang Li3, Zhang Weigong2,3   

  1. 1(Information Engineering College, Capital Normal University, Beijing 100048);2(Beijing Engineering Research Center of High Reliable Embedded System (Capital Normal University), Beijing 100048);3(Beijing Advanced Innovation Center for Imaging Theory and Technology (Capital Normal University), Beijing 100048)
  • Online:2020-06-01
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (61772350), the Beijing Nova Program (Z181100006218093), the Research Fund from Beijing Innovation Center for Future Chips (KYJJ2018008), the Construction Plan of Beijing High-level Teacher Team (CIT&TCD201704082), and the Capacity Building for Sci-Tech Innovation Fundamental Scientific Research Funds (19530050173).

Abstract: Spark has been increasingly employed by industries for big data analytics recently, due to its efficient in-memory distributed programming model. Most existing optimization and analysis tool of Spark perform at either application layer or operating system layer separately, which makes Spark semantics separate from the underlying actions. For example, unknowing the impaction of operating system parameters on performance of Spark layer will lead unknowing of how to use OS parameters to tune system performance. In this paper, we propose SMTT, a new Spark memory tracing toolkit, which establishes the semantics of the upper application and the underlying physical hardware across Spark layer, JVM layer and OS layer. Based on the characteristics of Spark memory, we design the tracking scheme of execution memory and storage memory respectively. Then we analyze the Spark iterative calculation process and execution/storage memory usage by SMTT. The experiment of RDD memory assessment analysis shows our toolkit could be effectively used on performance analysis and provide guides for optimization of Spark memory system.

Key words: big data, Spark, memory management, cross-layer analysis, memory tracing

CLC Number: