Optimization of Parallel Computation on Sparse Matrix-Vector Multiplication with High Predictability

Xia Tian; Fu Gelin; Qu Shaoru; Luo Zhongpei; Ren Pengju

doi:10.7544/issn1000-1239.202330421

Journal of Computer Research and Development > 2023 > 60(9): 1973-1987. > DOI: 10.7544/issn1000-1239.202330421

Xia Tian, Fu Gelin, Qu Shaoru, Luo Zhongpei, Ren Pengju. Optimization of Parallel Computation on Sparse Matrix-Vector Multiplication with High Predictability[J]. Journal of Computer Research and Development, 2023, 60(9): 1973-1987. DOI: 10.7544/issn1000-1239.202330421

Citation:

PDF (2504 KB)

Optimization of Parallel Computation on Sparse Matrix-Vector Multiplication with High Predictability

Xia Tian^{1, 2, 3},
Fu Gelin³,
Qu Shaoru³,
Luo Zhongpei³,
Ren Pengju^{1, 2, 3,}

1.
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence(Xi’an Jiaotong University), Xi’an 710049
2.
National Engineering Research Center of Visual Information and Applications(Xi’an Jiaotong University), Xi’an 710049
3.
Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an 710049

Funds: This work was supported by the National Key Research and Development Program of China（2022YFB4500500）and the Key Research and Development Program of Shaanxi Province（2022ZDLGY01-08）.

More Information

Author Bio:
Xia Tian: born in 1988. PhD, associate professor. His main research interests include cloud-computing virtualization and innovative parallel computation architecture

Fu Gelin: born in 1997. PhD candidate. His main research interests include system modeling and optimization, and computer architecture

Qu Shaoru: born in 2000. Master candidate. His main research interests include high performance computing and computer architecture

Luo Zhongpei: born in 1999. Master candidate. His main research interests include high performance computing and GPGPU technology

Ren Pengju: born in 1984. PhD, professor. His main interests include on-chip network and computer architecture
Received Date: May 30, 2023
Revised Date: July 24, 2023
Available Online: July 31, 2023

Graphical Abstract

Abstract

Abstract

Sparse matrix-vector multiplication (SpMV) has been widely applied in scientific computation, industry simulation and intelligent computation domains, which is the critical algorithm in all these applications. Usually, iterative computation of SpMV is required to fulfill precise numeric simulation, linear algebra solving and graph analytics requirements. However, due to the poor data locality, low cache usage and extreme irregular computation patterns caused by the highly sparse and random distributions, SpMV optimization has become one of the most challenging problems for modern high-performance processors. In this paper, we study the bottlenecks of SpMV on current out-of-order CPUs and propose to improve its performance by pursuing high predictability and low program complexity. Specifically, we improve the memory access regularity and locality by creating serialized access patterns so that the data prefetching efficiency and cache usage are optimized. We also improve the pipeline efficiency by creating regular branch patterns to make the branch prediction more accurate. Meanwhile, we flexibly lever the SIMD instructions to optimize the parallel execution and fully use CPU’s computation resources. Experimental results show that using the above optimization approaches, our SpMV kernel is effective to significantly alleviate the critical bottlenecks and improve the efficiency of CPU pipeline, cache and memory bandwidth usage. The resulting performance achieves average 2.6 times speedup against Intel’s commercial library of MKL, as well as average 1.3 times speedup against the existing best SpMV algorithm.
- matrix-vector multiplication,
- sparse matrix computation,
- matrix format,
- branch prediction,
- data prefetching

FullText(HTML)

References (24)

References

[1]	Sarıyüce E A, Saule E, Kaya K, et al. Regularizing graph centrality computations[J]. Journal of Parallel and Distributed Computing, 2015, 76: 106−119 doi: 10.1016/j.jpdc.2014.07.006
[2]	Malewicz G, Austern H M, Bik J C A, et al. Pregel: A system for large-scale graph processing[C]//Proc of ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2010: 135–146
[3]	Lin Li, Liao Xiaofei, Tan Guang, et al. LiveRender: A cloud gaming system based on compressed graphics streaming[C]//Proc of the 22nd ACM Int Conf on Multimedia. New York: ACM, 2014: 347–356
[4]	Hua Qiangsheng, Fan Haoqiang, Qian Lixiang, et al. Brief announcement: A tight distributed algorithm for all pairs shortest paths and applications[C]//Proc of the 28th ACM Symp on Parallelism in Algorithms and Architectures. New York: ACM, 2016: 439–441
[5]	谢震, 谭光明, 孙凝晖. 基于PPR模型的稀疏矩阵向量乘及卷积性能优化研究[J]. 计算机研究与发展, 2021,58（3）: 445−457 Xie Zhen, Tan Guangming, Sun Ninghui. Research on optimal performance of sparse matrix-vector multiplication and convoulution using the probability-process-RAM model[J]. Journal of Computer Research and Development, 2021, 58(3): 445−457(in Chinese)
[6]	Wang Endong, Zhang Qing, Shen Bo, et al. Intel Math Kernel Library[M]//High-Performance Computing on the Intel^® Xeon Phi™. Beilin: Springer, 2014: 167–188
[7]	Liu Xing, Smelyankiy M, Chow E, et al. Efficient sparse matrix-vector multiplication on x86-based many-core processors[C]//Proc of the 27th Int ACM Conf on Int Conf on Supercomputing. New York: ACM, 2013: 273–282
[8]	Liu Weifeng, Vinter B. CSR5: An efficient storage format for cross platform sparse matrix-vector multiplication[C]//Proc of the 29th Int Conf on Supercomputing. New York: ACM, 2015: 339–350
[9]	Xie Biwei, Zhan Jianfeng, Liu Xu, et al. CVR: Efficient vectorization of spmv on x86 processors[C]//Proc of the 2018 Int Symp on Code Generation and Optimization. New York: ACM, 2018: 149–162
[10]	Buluc A, Fineman T J, Frigo M, et al. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks[C]//Proc of the 21st Annual Symp on Parallelism in Algorithms and Architectures. New York: ACM, 2009: 233–244
[11]	Choi W J, Singh A, Vuduc W R. Model-driven autotuning of sparse matrix-vector multiply on GPUs[C]//Proc of the 15th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2010: 115–126
[12]	Liang Yun, Tang TengWai, Zhao Ruizhe, et al. Scale-free sparse matrix-vector multiplication on many-core architectures[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017, 36(12): 2106−2119 doi: 10.1109/TCAD.2017.2681072
[13]	Im E J, Yelick K, Vuduc R. Sparsity: Optimization framework for sparse matrix kernels[J]. International Journal of High Performance Applications, 2004, 18(1): 135−158 doi: 10.1177/1094342004041296
[14]	Hong Changwan, Sukumaran-Rajam A, Nisa I, et al. Adaptive sparse tiling for sparse matrix multiplication[C]//Proc of the 24th Symp on Principles and Practice of Parallel Programming. New York: ACM, 2019: 300–314
[15]	Davis A T, Hu Yifan. The University of Florida sparse matrix collection[J]. ACM Transactions on Mathematical Software, 2011, 38(1): 1−25
[16]	Dyksen R W, Ribbens J C. Interactive ELLPACK: An interactive problem-solving environment for elliptic partial differential equations[J]. ACM Transactions on Mathematical Software, 1987, 13(2): 113−132 doi: 10.1145/328512.328515
[17]	Kourtis K, Karakasis V, Goumas G, et al. CSX: An extended compression format for SpMV on shared memory systems[C]//Proc of the 16th ACM Symp on Principles and Practice of Parallel Programming. New York: ACM, 2011: 247–256
[18]	Ashari A, Sedaghati N, Eisenlohr J, et al. An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs[C]//Proc of the 28th ACM Int Conf on Supercomputing. New York: ACM, 2014: 273–282
[19]	Chen Tien-Fu, Baer L J. Effective hardware-based data prefetching for high-performance processors[J]. IEEE Transactions on Computers, 1995, 44(5): 609−623 doi: 10.1109/12.381947
[20]	Xia Tian, Fu Gelin, Li Chenyang, et al. A comprehensive performance model of sparse matrix-vector multiplication to guide kernel optimization[J]. IEEE Transactions on Parallel and Distributed Systems, 2023, 34(2): 519−534 doi: 10.1109/TPDS.2022.3225230
[21]	Karkhanis S T, Smith E J. A first-order superscalar processor model[C]//Proc of 31st Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2004: 338–349
[22]	Tang Tengwai, Zhao Ruizhe, Lu Mian et al. Optimizing and auto-tuning scale-free sparse matrix vector multiplication on Intel Xeon Phi[C]// Proc of IEEE/ACM Int Symp on Code Generation and Optimization. New York: ACM, 2015: 136–145
[23]	Seznez A. A new case for the TAGE branch predictor[C]//Proc of the 44th Annual IEEE/ACM Int Symp on Microarchitecture. New York: ACM, 2011: 117–127
[24]	Bian Haodong, Huang Jianqiang, Dong Runting et al. CSR2: A new format for SIMD-accelerated SpMV[C]//Proc of IEEE/ACM Int Symp on Cluster, Cloud and Internet Computing (CCGRID). Piscataway, NJ: IEEE, 2020: 350−359.

[1]	Liu Le, Guo Shengnan, Jin Xiyuan, Zhao Miaomiao, Chen Ran, Lin Youfang, Wan Huaiyu. Spatial-Temporal Traffic Data Imputation Method with Uncertainty Modeling[J]. Journal of Computer Research and Development, 2025, 62(2): 346-363. DOI: 10.7544/issn1000-1239.202330455
[2]	Xu Xiao, Ding Shifei, Sun Tongfeng, Liao Hongmei. Large-Scale Density Peaks Clustering Algorithm Based on Grid Screening[J]. Journal of Computer Research and Development, 2018, 55(11): 2419-2429. DOI: 10.7544/issn1000-1239.2018.20170227
[3]	Wang Haiyan, Xiao Yikang. Dynamic Group Discovery Based on Density Peaks Clustering[J]. Journal of Computer Research and Development, 2018, 55(2): 391-399. DOI: 10.7544/issn1000-1239.2018.20160928
[4]	Ren Lifang, Wang Wenjian, Xu Hang. Uncertainty-Aware Adaptive Service Composition in Cloud Computing[J]. Journal of Computer Research and Development, 2016, 53(12): 2867-2881. DOI: 10.7544/issn1000-1239.2016.20150078
[5]	Xu Zhengguo, Zheng Hui, He Liang, Yao Jiaqi. Self-Adaptive Clustering Based on Local Density by Descending Search[J]. Journal of Computer Research and Development, 2016, 53(8): 1719-1728. DOI: 10.7544/issn1000-1239.2016.20160136
[6]	Xu Min, Deng Zhaohong, Wang Shitong, Shi Yingzhong. MMCKDE: m-Mixed Clustering Kernel Density Estimation over Data Streams[J]. Journal of Computer Research and Development, 2014, 51(10): 2277-2294. DOI: 10.7544/issn1000-1239.2014.20130718
[7]	Qi Yafei, Wang Yijie, and Li Xiaoyong. A Skyline Query Method over Gaussian Model Uncertain Data Streams[J]. Journal of Computer Research and Development, 2012, 49(7): 1467-1473.
[8]	Pan Weimin and He Jun. Neuro-Fuzzy System Modeling with Density-Based Clustering[J]. Journal of Computer Research and Development, 2010, 47(11): 1986-1992.
[9]	Chen Jianmei, Lu Hu, Song Yuqing, Song Shunlin, Xu Jing, Xie Conghua, Ni Weiwei. A Possibility Fuzzy Clustering Algorithm Based on the Uncertainty Membership[J]. Journal of Computer Research and Development, 2008, 45(9): 1486-1492.
[10]	Ma Liang, Chen Qunxiu, and Cai Lianhong. An Improved Model for Adaptive Text Information Filtering[J]. Journal of Computer Research and Development, 2005, 42(1): 79-84.