Xu Danya, Wang Jing, Wang Li, Zhang Weigong. A Cross-Layer Memory Tracing Toolkit for Big Data Application Based on Spark[J]. Journal of Computer Research and Development, 2020, 57(6): 1179-1190. DOI: 10.7544/issn1000-1239.2020.20200109
Citation:
Xu Danya, Wang Jing, Wang Li, Zhang Weigong. A Cross-Layer Memory Tracing Toolkit for Big Data Application Based on Spark[J]. Journal of Computer Research and Development, 2020, 57(6): 1179-1190. DOI: 10.7544/issn1000-1239.2020.20200109
Xu Danya, Wang Jing, Wang Li, Zhang Weigong. A Cross-Layer Memory Tracing Toolkit for Big Data Application Based on Spark[J]. Journal of Computer Research and Development, 2020, 57(6): 1179-1190. DOI: 10.7544/issn1000-1239.2020.20200109
Citation:
Xu Danya, Wang Jing, Wang Li, Zhang Weigong. A Cross-Layer Memory Tracing Toolkit for Big Data Application Based on Spark[J]. Journal of Computer Research and Development, 2020, 57(6): 1179-1190. DOI: 10.7544/issn1000-1239.2020.20200109
1(Information Engineering College, Capital Normal University, Beijing 100048)
2(Beijing Engineering Research Center of High Reliable Embedded System (Capital Normal University), Beijing 100048)
3(Beijing Advanced Innovation Center for Imaging Theory and Technology (Capital Normal University), Beijing 100048)
Funds: This work was supported by the National Natural Science Foundation of China (61772350), the Beijing Nova Program (Z181100006218093), the Research Fund from Beijing Innovation Center for Future Chips (KYJJ2018008), the Construction Plan of Beijing High-level Teacher Team (CIT&TCD201704082), and the Capacity Building for Sci-Tech Innovation Fundamental Scientific Research Funds (19530050173).
Spark has been increasingly employed by industries for big data analytics recently, due to its efficient in-memory distributed programming model. Most existing optimization and analysis tool of Spark perform at either application layer or operating system layer separately, which makes Spark semantics separate from the underlying actions. For example, unknowing the impaction of operating system parameters on performance of Spark layer will lead unknowing of how to use OS parameters to tune system performance. In this paper, we propose SMTT, a new Spark memory tracing toolkit, which establishes the semantics of the upper application and the underlying physical hardware across Spark layer, JVM layer and OS layer. Based on the characteristics of Spark memory, we design the tracking scheme of execution memory and storage memory respectively. Then we analyze the Spark iterative calculation process and execution/storage memory usage by SMTT. The experiment of RDD memory assessment analysis shows our toolkit could be effectively used on performance analysis and provide guides for optimization of Spark memory system.