ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2021, Vol. 58 ›› Issue (3): 445-457.doi: 10.7544/issn1000-1239.2021.20180601

• 软件技术 •    下一篇

基于PPR模型的稀疏矩阵向量乘及卷积性能优化研究

谢震1,2,3,谭光明1,2,孙凝晖1,2   

  1. 1(计算机体系结构国家重点实验室(中国科学院计算技术研究所) 北京 100190);2(中国科学院计算技术研究所 北京 100190);3(中国科学院大学计算机与控制学院 北京 100049) (xiezhen@ncic.ac.cn)
  • 出版日期: 2021-03-01
  • 基金资助: 
    国家重点研发项目(2018YFB0204400);中国科学院战略性先导科技专项(C类)(XDC05010100);国家自然科学基金项目(62032023, 61972377, 61702483)

Research on Optimal Performance of Sparse Matrix-Vector Multiplication and Convoulution Using the Probability-Process-Ram Model

Xie Zhen1,2,3, Tan Guangming1,2, Sun Ninghui1,2   

  1. 1(State Key Laboratory of Computer Architecture (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190);2(Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190);3(School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100049)
  • Online: 2021-03-01
  • Supported by: 
    This work was supported by the National Key Research and Development Program of China (2018YFB0204400), the Strategic Priority Research Program of Chinese Academy of Sciences (C)(XDC05010100), and the National Natural Science Foundation of China (62032023, 61972377, 61702483).

摘要: 稀疏矩阵向量乘和卷积作为高性能计算的两大计算核心, 是非规则和规则访存的典型代表.目前已经做了许多针对性的优化工作, 但是对于大量运行着不同指令集和拥有不同计算和访存性能的机器, 仍然无法判定在特定的体系结构下导致性能效率无法被完全释放的主要原因及性能瓶颈, 同时也很难准确预测出程序在特定机器上可达到的最佳性能.通过使用性能模型方法, 建模程序在真实机器上的运行细节, 可以得出更加精确的性能预测, 并且根据模型输出的反馈信息提出针对性的优化指导.提出了PPR(probability-process-ram)模型, 并在一个通用处理器上建模程序内指令执行和数据传输开销, 其中包括使用模型预测各种指令数量及内存层次之间的数据传输大小去分析程序各个阶段的性能瓶颈, 并且根据模型反馈的信息提出优化方案以及优化后的性能期望.最终使用PPR建模和优化2个计算核心, 同时也比较了与常用的Roofline和ECM模型的区别.

关键词: 性能模型, 反馈优化, 稀疏矩阵向量乘, 卷积, cache模拟器

Abstract: Performance models provide insightful perspectives to allow us to predict performance and propose optimization guidance. Although there has been much research, pinpointing bottlenecks of various memory access patterns and reaching high performance of both regular and irregular programs on various hardware configurations are still not trivial. In this work, we propose a novel model called probability-process-ram (PPR) to quantify the amount of compute and data transfer time on general-purpose multicore processors. The PPR model predicts the number of instruction for single-core and probability of memory access between each memory hierarchy through a newly designed cache simulator. By using the automatically extracted best optimization method and expectation, we use PPR model for analyzing and optimizing sparse matrix-vector multiplication and 1D convolution as case study for typical irregular and regular computational kernels. Then we obtain best block sizes for sparse matrices with various sparsity structures, as well as optimal optimization guidance for 1D convolution with different instruction sets support and data sizes. Comparison with Roofline model and ECM model, the proposed PPR model greatly improves prediction accuracy by the newly designed cache simulator and achieves comprehensive feedback ability.

Key words: performance model, feedback optimization, sparse matrix-vector multiplication, convolu-tion, cache simulator

中图分类号: