HVMS: A Hybrid Vectorization-Optimized Mechanism of SpMV

Yan Zhiyuan; Xie Biwei; Bao Yungang

doi:10.7544/issn1000-1239.202330204

Journal of Computer Research and Development > 2024 > 61(12): 2969-2984. > DOI: 10.7544/issn1000-1239.202330204 CSTR: 32373.14.issn1000-1239.202330204

Yan Zhiyuan, Xie Biwei, Bao Yungang. HVMS: A Hybrid Vectorization-Optimized Mechanism of SpMV[J]. Journal of Computer Research and Development, 2024, 61(12): 2969-2984. DOI: 10.7544/issn1000-1239.202330204

Citation:

PDF (3274 KB)

HVMS: A Hybrid Vectorization-Optimized Mechanism of SpMV

State Key Lab of Processors (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190
University of Chinese Academy of Sciences, Beijing 100049

Funds: This work was supported by the National Key Research and Development Program of China (2022YFB4500403)，the National Natural Science Foundation of China (62090022)，and the Strategic Priority Research Program of the Chinese Academy of Sciences (XDA0320300).

More Information

Author Bio:
Yan Zhiyuan: born in 1992. PhD candidate. Her main research interests include high performance computing and computer architecture

Xie Biwei: born in 1987. PhD, assistant professor. His main research interests include open-source EDA, open-source chip, high performance computing, and computer architecture

Bao Yungang: born in 1980. PhD, professor. His main research interests include data-center architecture, agile design methodology of processor chips, and ecosystem of open-source processor chips
Received Date: March 30, 2023
Revised Date: August 13, 2023
Available Online: March 13, 2024

Graphical Abstract

Abstract

Abstract

Sparse matrix-vector multiplication (SpMV) plays an important role in a wide variety of scientific and engineering applications.Executing SpMV efficiently on vector processors is challenging because of the irregular memory access of sparse matrices. We conduct a detailed analysis of the main factors that have impact on the efficiency of SpMV vectorization.In addition to the irregular distribution of non-zero elements within sparse matrices, different sparse matrices also exhibit huge variations in the distribution characteristics of non-zero elements. Therefore, it is difficult to apply a universal vector optimization method for matrices with diverse characteristics. Furthermore, there is a big difference in vector computing and vector instructions for various vector processors. The primary challenge of SpMV vectorization lies in mapping the irregular sparse matrices onto the regular vector processor. In this paper, we propose a hybrid vectorization-optimized mechanism of SpMV(HVMS). HVMS models the characteristics of vector processors and designs corresponding rules based on the abstracted basic operations to guide the regularization conversion of sparse matrices. HVMS divides the matrix into different parts according to the different characteristics. For each part, the non-zero distribution can be less irregular and then HVMS introduces corresponding optimization mechanisms to boost the vectorization efficiency of SpMV. We implement and evaluate HVMS on an Intel Xeon processor and compare it with three state-of-the-art approaches using 30 sparse matrices. Experimental results show that HVMS can achieve an average speedup of 1.60x, 1.72x, and 1.93x over CVR, SELL-C-σ, and Intel MKL, respectively.
- SpMV,
- vectorization,
- sparse matrices,
- SIMD,
- multi-core system

FullText(HTML)

References (40)

References

[1]	Zilli G. Iterative methods for solving sparse linear systems with a parallel preconditioner[J]. International Journal of Computer Mathematics, 1992, 44(1/2/3/4): 111−119
[2]	Diaz J, Munoz-Caro C, Nino A. A survey of parallel programming models and tools in the multi and many-core era[J]. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(8): 1369−1386 doi: 10.1109/TPDS.2011.308
[3]	Chrysos G. Intel^® Xeon Phi^TM coprocessor-the architecture[EB/OL]. 2014[2023-07-31]. http://gec.di.uminho.pt/minf/cpd1314/SCD/Intel_Xeon-PhiArch.pdf
[4]	Anderson C S, Zhang Jingwei, Cornea M. Enhanced vector math support on the Intel^® avx-512 architecture[C]//Proc of the 25th Symp on Computer Arithmetic (ARITH). Piscataway, NJ: IEEE, 2018: 120−124
[5]	Reddy V G. NEON technology introduction[J]. ARM Corporation, 2008, 4(1): 1−33
[6]	Wang Endong, Zhang Qing, Shen Bo, et al. Intel math kernel library[EB/OL]. 2014[2023-06-13].https://link.springer.com/chapter/10.1007/978-3-319-06486-4_7
[7]	Liu Weifeng, Vinter B. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication[C]//Proc of the 29th Int Conf on Supercomputing. New York: ACM, 2015: 339−350
[8]	Xie Biwei, Zhan Jianfeng, Liu Xu, et al. CVR: Efficient vectorization of SpMV on x86 processors[C]//Proc of the 16th Annual IEEE/ACM Int Symp on Code Generation and Optimization. New York: ACM, 2018: 149−162
[9]	Kreutzer M, Hager G, Wellein G, et al. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units[J]. SIAM Journal on Scientific Computing, 2014, 36(5): C401−C423 doi: 10.1137/130930352
[10]	Flynn M J. Some computer organizations and their effectiveness[J]. IEEE Transactions on Computers, 1972, 21(9): 948−960
[11]	Lomont C. Introduction to Intel advanced vector extensions[EB/OL]. 2011[2023-06-13].https://hpc.llnl.gov/sites/default/files/intelAVXintro.pdf
[12]	Oberman S, Favor G, Weber F. AMD 3DNow! Technology: Architecture and implementations[J]. IEEE Micro, 1999, 19(2): 37−48 doi: 10.1109/40.755466
[13]	Luebke D, Harris M, Govindaraju N, et al. GPGPU: General-purpose computation on graphics hardware[C/OL]//Proc of the 2006 ACM/IEEE Conf on Supercomputing. New York: ACM, 2006[2023-08-04].https://dl.acm.org/doi/10.1145/1188455.1188672
[14]	Davis T A, Hu Yifan. The University of Florida sparse matrix collection[J]. ACM Transactions on Mathematical Software, 2011, 38(1): 1−25
[15]	Li Chenyang, Xia Tian, Zhao Wenzhe, et al. SpV8: Pursuing optimal vectorization and regular computation pattern in SpMV[C]//Proc of the 58th ACM/IEEE Design Automation Conf (DAC). Piscataway, NJ: IEEE, 2021: 661−666
[16]	Bian Haodong, Huang Jianqiang, Liu Lingbin, et al. ALBUS: A method for efficiently processing SpMV using SIMD and load balancing[J]. Future Generation Computer Systems, 2021, 116: 371−392 doi: 10.1016/j.future.2020.10.036
[17]	Bian Haodong, Huang Jianqiang, Dong Runting, et al. CSR2: A new format for SIMD-accelerated SpMV[C]//Proc of the 20th IEEE/ACM Int Symp on Cluster, Cloud and Internet Computing (CCGRID). Piscataway, NJ: IEEE, 2020: 350−359
[18]	Tang Waiteng, Zhao Ruizhe, Lu Mian, et al. Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on Intel Xeon Phi[C]//Proc of the 13th Annual IEEE/ACM Int Symp on Code Generation and Optimization. Piscataway, NJ: IEEE, 2015: 136−145
[19]	Eberhardt R, Hoemmen M. Optimization of block sparse matrix-vector multiplication on shared-memory parallel architectures[C]//Proc of the 2016 IEEE Int Parallel and Distributed Processing Symp Workshops (IPDPSW). Piscataway, NJ: IEEE, 2016: 663−672
[20]	Yan Shengen, Li Chao, Zhang Yunquan, et al. yaSpMV: Yet another SpMV framework on GPUS[J]. ACM SIGPLAN Notices, 2014, 49(8): 107−118 doi: 10.1145/2692916.2555255
[21]	Merrill D, Garland M. Merge-based parallel sparse matrix-vector multiplication[C]//Proc of the 2016 Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2016: 678−689
[22]	Yesil S, Heidarshenas A, Morrison A, et al. Speeding up SpMV for power-law graph analytics by enhancing locality & vectorization[C/OL]//Proc of the 2020 Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2020[2023-07-31].https://ieeexplore.ieee.org/document/9355205
[23]	Elafrou A, Goumas G, Koziris N. Conflict-free symmetric sparse matrix-vector multiplication on multicore architectures[C/OL]//Proc of the 2019 Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2019[2023-07-31].https://dl.acm.org/doi/10.1145/3295500.3356148
[24]	Fei Xiang, Zhang Youhui. Regu2D: Accelerating vectorization of SpMV on Intel processors through 2D-partitioning and regular arrangement[C/OL]//Proc of the 50th Int Conf on Parallel Processing. New York: ACM, 2021[2023-07-31].https://dl.acm.org/doi/10.1145/3472456.3472479
[25]	Liu Changxi, Xie Biwei, Liu Xin, et al. Towards efficient SpMV on sunway manycore architectures[C]//Proc of the 2018 Int Conf on Supercomputing. New York: ACM, 2018: 363−373
[26]	Buono D, Petrini F, Checconi F, et al. Optimizing sparse matrix-vector multiplication for large-scale data analytics[C/OL]//Proc of the 2016 Int Conf on Supercomputing. New York: ACM, 2016[2023-07-31].https://dl.acm.org/doi/10.1145/2925426.2926278
[27]	Umuroglu Y, Jahre M. An energy efficient column-major backend for FPGA SpMV accelerators[C]//Proc of the 32nd Int Conf on Computer Design (ICCD). Piscataway, NJ: IEEE, 2014: 432−439
[28]	Liu Xing, Smelyanskiy M, Chow E, et al. Efficient sparse matrix-vector multiplication on x86-based many-core processors[C]//Proc of the 27th Int ACM Conf on Int Conf on Supercomputing. New York: ACM, 2013: 273−282
[29]	Li Kenli, Yang Wangdong, Li Keqin. Performance analysis and optimization for SpMV on GPU using probabilistic modeling[J]. IEEE Transactions on Parallel and Distributed Systems, 2014, 26(1): 196−205
[30]	Niu Yuyao, Lu Zhengyang, Dong Meichen, et al. TileSpMV: A tiled algorithm for sparse matrix-vector multiplication on GPUS[C]//Proc of the 2021 IEEE Int Parallel and Distributed Processing Symp (IPDPS). Piscataway, NJ: IEEE, 2021: 68−78
[31]	Greathouse J L, Daga M. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format[C]//Proc of the 2014 Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2014: 769−780
[32]	Li Jiajia, Tan Guangming, Chen Mingyu, et al. SMAT: An input adaptive auto-tuner for sparse matrix-vector multiplication[C]//Proc of the 34th ACM SIGPLAN Conf on Programming Language Design and Implementation. New York: ACM, 2013: 117−126
[33]	Zhao Yue, Li Jiajia, Liao Chunhua, et al. Bridging the gap between deep learning and sparse matrix format selection[C]//Proc of the 23rd ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2018: 94−108
[34]	Feng Siying, Sun Jiawen, Pal S, et al. Cosparse: A software and hardware reconfigurable SpMV framework for graph analytics[C]//Proc of the 58th ACM/IEEE Design Automation Conf (DAC). Piscataway, NJ: IEEE, 2021: 949−954
[35]	Tang Waiteng, Zhao Ruizhe, Lu Mian, et al. Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on Intel Xeon Phi[C]//Proc of the 13th Annual IEEE/ACM Int Symp on Code Generation and Optimization. New York: ACM, 2015: 136−145
[36]	Zhao Yue, Zhou Weijie, Shen Xipeng, et al. Overhead-conscious format selection for SpMV-based applications[C]//Proc of the 2018 IEEE Int Parallel and Distributed Processing Symp (IPDPS). Piscataway, NJ: IEEE, 2018: 950−959
[37]	Benatia A, Ji Weixing, Wang Yizhuo, et al. Sparse matrix format selection with multiclass svm for SpMV on GPU[C]//Proc of the 45th Int Conf on Parallel Processing (ICPP). Piscataway, NJ: IEEE, 2016: 496−505
[38]	谢震,谭光明,孙凝晖. 基于PPR 模型的稀疏矩阵向量乘及卷积性能优化研究[J]. 计算机研究与发展,2021,58(3):445−457 Xie Zhen, Tan Guangming, Sun Ninghui. Research on optimal performance of sparse matrix-vector multiplication and convoulution using the probability-process-RAM model[J]. Journal of Computer Research and Development, 2021, 58(3): 445−457(in Chinese)
[39]	Choi J W, Singh A, Vuduc R W. Model-driven autotuning of sparse matrix-vector multiply on GPUS[J]. ACM SIGPLAN Notices, 2010, 45(5): 115−126 doi: 10.1145/1837853.1693471
[40]	Vuduc R, Demmel J W, Yelick K A. OSKI: A library of automatically tuned sparse matrix kernels[J]. Journal of Physics: Conference Series. 2005, 16(1): 521−530

Supplements (1)

Supplements
Other Related Supplements
- Video
  https://www.bilibili.com/video/BV1CFZ3YREJg/

Cited By

Cited by

Periodical cited type(2)

1.	樊青龙，耿磊，邓亚明，梁志斌，张敬文，董刚. 五举煤业智能洗选综合管控平台设计与应用. 选煤技术. 2025(01): 64-74 .
2.	陈秀丽. 分布式数据库系统在云计算环境中的数据一致性保障机制. 信息与电脑(理论版). 2024(08): 137-139 .