ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2019, Vol. 56 ›› Issue (2): 410-420.doi: 10.7544/issn1000-1239.2019.20170908

• 软件技术 • 上一篇    下一篇

双精度浮点矩阵乘协处理器研究

贾迅,邬贵明,谢向辉,吴东   

  1. (数学工程与先进计算国家重点实验室 江苏无锡 214125) (jia.xun@meac-skl.cn)
  • 出版日期: 2019-02-01
  • 基金资助: 
    国家自然科学基金项目(91430214,61732018)

A Coprocessor for Double-Precision Floating-Point Matrix Multiplication

Jia Xun, Wu Guiming, Xie Xianghui, Wu Dong   

  1. (State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, Jiangsu 214125)
  • Online: 2019-02-01

摘要: 矩阵乘运算在多个应用领域特别是数值计算领域被广泛使用,但双精度浮点矩阵乘在CPU,GPGPU,FPGA等现有计算平台上的性能和效率受限,其往往成为大规模数值计算应用的性能瓶颈.针对该问题,以线性阵列计算结构为基础,研究了双精度浮点矩阵乘的定制加速.首先,对线性阵列计算结构进行了双缓冲优化并设计了针对双缓冲的存储访问调度,以提高结构的计算效率.其次,提出了矩阵乘协处理器和加速计算系统的结构,构建了协处理器的性能模型并对其结构设计空间进行了探索.最后,验证了协处理器的功能正确性并在某主流工艺下评估了其硬件开销.实验结果表明,设计的双精度浮点矩阵乘协处理器可以达到3 TFLOPS的计算性能和99%的计算效率.与NVIDIA K40 GPGPU相比,协处理器执行双精度浮点矩阵乘的性能是K40的1.95倍,而面积开销仅为K40的21.05%.探索了定制加速结构设计在高性能计算中的应用,对现有计算系统的性能提升具有一定的参考价值.

关键词: 矩阵乘, 协处理器, 加速, 浮点, 硬件定制

Abstract: Matrix multiplication has been widely used in various application fields, especially the field of numerical computation. However, double-precision floating-point matrix multiplication suffers from non-optimal performance or efficiency on contemporary computing platforms, including CPU, GPGPU and FPGA. To address this problem, acceleration of double-precision floating-point matrix multiplication with a customized coprocessor is proposed in this paper, which adopts linear array as the basic building block. Firstly, double-buffering technique and optimized memory scheduling are applied to the basic linear array for better computation efficiency. Then, architecture of the matrix multiplication coprocessor and coprocessor-based accelerated computing system are formulated. Furthermore, a performance model tailored for the coprocessor is developed and the design space of coprocessor is explored in detail. Finally, functional correctness of the coprocessor is verified and its hardware implementation cost under mainstream technology node is evaluated. Experimental results show that the proposed coprocessor can achieve the performance of 3 TFLOPS and the efficiency of 99%. Compared with NVIDIA K40 GPGPU for executing double-precision floating-point matrix multiplication, the coprocessor proposed in this paper achieves 1.95× performance with hardware overheads of only 21.05% in area. This work explores the application of customized acceleration in high-performance computing and has certain guidance for improving performance of existing computing systems.

Key words: matrix multiplication, coprocessor, acceleration, floating-point, hardware customization

中图分类号: