基于Spark的大数据访存行为跨层分析工具

许丹亚; 王晶; 王利; 张伟功

doi:10.7544/issn1000-1239.2020.20200109

基于Spark的大数据访存行为跨层分析工具

许丹亚¹,
王晶^1,2,
王利³,
张伟功^2,3

¹(首都师范大学信息工程学院北京 100048)
²(高可靠嵌入式技术北京市工程研究中心(首都师范大学) 北京 100048)
³(北京成像理论与技术高精尖创新中心(首都师范大学) 北京 100048) (xudanya@cnu.edu.cn)

基金项目: 国家自然科学基金项目(61772350)；北京市科技新星计划(Z181100006218093)；北京未来芯片技术高精尖创新中心科研基金项目(KYJJ2018008)；北京市高水平教师队伍建设计划(CIT＆TCD201704082)；科技创新服务能力建设-基本科研业务费(科研类)(19530050173)

详细信息

中图分类号: TP391
计量
- 文章访问数: 1262
- HTML全文浏览量: 3
- PDF下载量: 699
出版历程
- 发布日期: 2020-05-31

A Cross-Layer Memory Tracing Toolkit for Big Data Application Based on Spark

¹(Information Engineering College, Capital Normal University, Beijing 100048)
²(Beijing Engineering Research Center of High Reliable Embedded System (Capital Normal University), Beijing 100048)
³(Beijing Advanced Innovation Center for Imaging Theory and Technology (Capital Normal University), Beijing 100048)

Funds: This work was supported by the National Natural Science Foundation of China (61772350), the Beijing Nova Program (Z181100006218093), the Research Fund from Beijing Innovation Center for Future Chips (KYJJ2018008), the Construction Plan of Beijing High-level Teacher Team (CIT＆TCD201704082), and the Capacity Building for Sci-Tech Innovation Fundamental Scientific Research Funds (19530050173).

摘要

摘要: 大数据时代的到来为信息处理带来了新的挑战，内存计算方式的Spark显著提高了数据处理的性能.Spark的性能优化和分析可以在应用层、系统层和硬件层开展，然而现有工作都只局限在某一层，使得Spark语义与底层动作脱离，如操作系统参数对Spark应用层的性能影响的缺失将使得大量灵活的操作系统配置参数无法发挥作用.针对上述问题，设计了Spark存储系统分析工具SMTT,打通了Spark层、JVM层和OS层，建立了上层应用程序的语义与底层物理内存信息的联系.SMTT针对Spark内存特点，分别设计了针对执行内存和存储内存的追踪方式.基于SMTT工具完成了对Spark迭代计算过程内存使用，以及跨越Spark，JVM和OS层的执行/存储内存使用过程的分析，并以RDD为例通过SMTT分析了单节点和多节点情况下Spark中读和写操作比例，结果表明该工作为Spark内存系统的性能分析和优化提供了有力的支持.
- 大数据 /
- Spark /
- 内存管理 /
- 跨层分析 /
- 内存追踪
Abstract: Spark has been increasingly employed by industries for big data analytics recently, due to its efficient in-memory distributed programming model. Most existing optimization and analysis tool of Spark perform at either application layer or operating system layer separately, which makes Spark semantics separate from the underlying actions. For example, unknowing the impaction of operating system parameters on performance of Spark layer will lead unknowing of how to use OS parameters to tune system performance. In this paper, we propose SMTT, a new Spark memory tracing toolkit, which establishes the semantics of the upper application and the underlying physical hardware across Spark layer, JVM layer and OS layer. Based on the characteristics of Spark memory, we design the tracking scheme of execution memory and storage memory respectively. Then we analyze the Spark iterative calculation process and execution/storage memory usage by SMTT. The experiment of RDD memory assessment analysis shows our toolkit could be effectively used on performance analysis and provide guides for optimization of Spark memory system.
- big data /
- Spark /
- memory management /
- cross-layer analysis /
- memory tracing

HTML全文

参考文献(0)

施引文献(2)

期刊类型引用(2)

1.	张安莉，谢檬，曾泽辉. 虚拟家用电器电流参数监测系统的设计. 电子设计工程. 2021(20): 67-71+76 . 百度学术
2.	侯文浩，凌云，徐敬成，黄文威. 基于决策树和贝叶斯分类器相结合的组合分类器电器类型识别方法. 新型工业化. 2018(06): 21-25+40 . 百度学术