Abstract:
In recent years, Sparse-Dense Matrix Multiplication (SpMM) has become a core operator in various fields such as scientific computing, graph processing, and deep learning, while also being a bottleneck limiting system performance. However, existing GPU implementations of SpMM generally suffer from low energy efficiency and resource utilization. Specialized accelerators demonstrate excellent energy efficiency in regular computations, while lack efficient support for irregular sparse matrix operations. To this end, this paper proposes an efficient sparse matrix processing kernel tailored for dataflow architecture, DFU-SpMM, targeting the high-throughput Dataflow Unit (DFU). Firstly, A hierarchical sparse matrix processing method is proposed to ensure that the sparse blocks have a regular calculation pattern to reduce the number of dataflow graph retransmissions on the SpMM kernel on DFU. Meanwhile, for sparse blocks with different characteristics, different strategies are used to process them. Secondly, an adaptive assembly code generation method for dataflow graph optimization is proposed for optimizing dataflow graph. By leveraging the inspector-executor framework to overcome the current limitations of DFU in generating SpMM kernels, the dataflow graph is further optimized during code generation with respect to task allocation and register reuse. Experiments on representative datasets demonstrate that, compared to state-of-the-art implementations DTC-SpMM and HP-SpMM on an RTX 4090 GPU, the proposed method achieves an average?1.09× and 1.23× higher energy efficiency, along with?1.71× and 1.10× better computational resource utilization, respectively. And compared to the baseline SpMM kernel on the DFU accelerator without data reorganization, the optimized kernel with data reorganization achieves?1.41× speedup and 1.51× energy efficiency improvement.