面向数据流架构的稀疏矩阵高效处理内核研究

蒋文斌; 刘宝阳; 董雨康; 沈欣海

doi:10.7544/issn1000-1239.202550678

面向数据流架构的稀疏矩阵高效处理内核研究

Research on Efficient Sparse-Dense Matrix Multiplication Kernel for Dataflow Architecture

摘要

摘要: 近年来，稀疏矩阵-密集矩阵乘（SpMM）已成为科学计算、图处理和深度学习等多个领域的核心算子和制约系统性能的瓶颈。然而，现有通用GPU上的SpMM实现普遍存在能效比、资源利用率低的问题，而一些专用的加速器虽然在规则计算上展现出了优异的能效比，但缺乏对不规则的稀疏矩阵运算的高效支持。针对此问题，论文基于高通量数据流众核处理器（DFU），提出了一种面向数据流架构的稀疏矩阵高效处理内核DFU-SpMM。首先提出数据流指令复用的分层稀疏矩阵处理方法，对稀疏数据进行重组织以减少SpMM内核的数据流图重传次数，同时，对于不同稀疏块使用不同的策略对其进行处理。然后提出面向流图优化的自适应汇编代码生成方法，引入观察者-执行者的思想，以突破现有DFU在SpMM内核生成方面的能力限制，并从任务分配和寄存器重用两个角度对代码生成过程中的数据流图构造进行优化。代表性数据集上的实验结果表明：相较于RTX 4090 GPU环境下的现有前沿工作DTC-SpMM和HP-SpMM，分别提高了平均1.09与1.23倍的能效比以及1.71与1.10倍的计算资源利用率；相较于DFU加速器上采用数据重组织策略前的SpMM内核，数据重组织后的SpMM内核实现了1.41倍的性能提升以及1.51倍的能效比提升。

Abstract: In recent years, Sparse-Dense Matrix Multiplication (SpMM) has become a core operator in various fields such as scientific computing, graph processing, and deep learning, while also being a bottleneck limiting system performance. However, existing GPU implementations of SpMM generally suffer from low energy efficiency and resource utilization. Specialized accelerators demonstrate excellent energy efficiency in regular computations, while lack efficient support for irregular sparse matrix operations. To this end, this paper proposes an efficient sparse matrix processing kernel tailored for dataflow architecture, DFU-SpMM, targeting the high-throughput Dataflow Unit (DFU). Firstly, A hierarchical sparse matrix processing method is proposed to ensure that the sparse blocks have a regular calculation pattern to reduce the number of dataflow graph retransmissions on the SpMM kernel on DFU. Meanwhile, for sparse blocks with different characteristics, different strategies are used to process them. Secondly, an adaptive assembly code generation method for dataflow graph optimization is proposed for optimizing dataflow graph. By leveraging the inspector-executor framework to overcome the current limitations of DFU in generating SpMM kernels, the dataflow graph is further optimized during code generation with respect to task allocation and register reuse. Experiments on representative datasets demonstrate that, compared to state-of-the-art implementations DTC-SpMM and HP-SpMM on an RTX 4090 GPU, the proposed method achieves an average?1.09× and 1.23× higher energy efficiency, along with?1.71× and 1.10× better computational resource utilization, respectively. And compared to the baseline SpMM kernel on the DFU accelerator without data reorganization, the optimized kernel with data reorganization achieves?1.41× speedup and 1.51× energy efficiency improvement.

HTML全文

参考文献(0)

施引文献

资源附件(0)