基于高预测性的稀疏矩阵向量乘法并行计算优化

夏天; 付格林; 曲劭儒; 罗中沛; 任鹏举

doi:10.7544/issn1000-1239.202330421

基于高预测性的稀疏矩阵向量乘法并行计算优化

Optimization of Parallel Computation on Sparse Matrix-Vector Multiplication with High Predictability

摘要

摘要: 稀疏矩阵向量乘法(sparse matrix-vector multiplication, SpMV)是广泛应用于科学计算、工业仿真和智能计算等领域的重要算法，是核心的计算行为之一. 在一些应用场景中，需要进行多次的SpMV迭代，以完成精确的数值模拟、线性代数求解和图分析收敛等计算要求. 受限于SpMV本身的高度随机性和稀疏性所导致的数据局部性极差、缓存效率极低、计算模式非常不规则等问题，导致其计算负载成为当前高性能处理器的优化难点和研究热点. 基于现代高性能超标量乱序处理器的架构特征，深入研究SpMV的各类性能瓶颈，并且提出从提升可预测性和降低程序复杂度的角度进行全面的性能优化. 其核心思想是：通过构建串行访问的数据结构，提升数据访问的规律性和局部性，大幅度优化数据预取效率和缓存利用效率；通过构建规则的分支跳转条件，提升程序的分支预测准确率，有效提升程序执行效率；通过灵活运用SIMD指令集，有效提升计算资源利用率. 通过对以上特性的优化，该方法可以显著缓解性能瓶颈，大幅度提升处理器资源、缓存资源和访存带宽的利用率，并且获得与主流商用计算库MKL相比平均2.6倍的加速比，相比于现有最先进算法获得平均1.3倍的加速比.

Abstract: Sparse matrix-vector multiplication (SpMV) has been widely applied in scientific computation, industry simulation and intelligent computation domains, which is the critical algorithm in all these applications. Usually, iterative computation of SpMV is required to fulfill precise numeric simulation, linear algebra solving and graph analytics requirements. However, due to the poor data locality, low cache usage and extreme irregular computation patterns caused by the highly sparse and random distributions, SpMV optimization has become one of the most challenging problems for modern high-performance processors. In this paper, we study the bottlenecks of SpMV on current out-of-order CPUs and propose to improve its performance by pursuing high predictability and low program complexity. Specifically, we improve the memory access regularity and locality by creating serialized access patterns so that the data prefetching efficiency and cache usage are optimized. We also improve the pipeline efficiency by creating regular branch patterns to make the branch prediction more accurate. Meanwhile, we flexibly lever the SIMD instructions to optimize the parallel execution and fully use CPU’s computation resources. Experimental results show that using the above optimization approaches, our SpMV kernel is effective to significantly alleviate the critical bottlenecks and improve the efficiency of CPU pipeline, cache and memory bandwidth usage. The resulting performance achieves average 2.6 times speedup against Intel’s commercial library of MKL, as well as average 1.3 times speedup against the existing best SpMV algorithm.

HTML全文

参考文献(24)

施引文献

资源附件(0)