SAF-CNN：面向嵌入式FPGA的卷积神经网络稀疏化加速框架

谢坤鹏; 仪德智; 刘义情; 刘航; 赫鑫宇; 龚成; 卢冶

doi:10.7544/issn1000-1239.202220735

SAF-CNN：面向嵌入式FPGA的卷积神经网络稀疏化加速框架

SAF-CNN：A Sparse Acceleration Framework of Convolutional Neural Network forEmbedded FPGAs

摘要

摘要: 传统的卷积神经网络加速器及推理框架在资源约束的FPGA上部署模型时，往往面临设备种类繁多且资源极端受限、数据带宽利用不充分、算子操作类型复杂难以适配且调度不合理等诸多挑战. 提出一种面向嵌入式FPGA的卷积神经网络稀疏化加速框架（sparse acceleration framework of convolutional neural network, SAF-CNN），通过软硬件协同设计的方法，从硬件加速器与软件推理框架2个角度进行联合优化. 首先, SAF-CNN构建并行计算阵列，并且设计并行编解码方案，实现单周期多数据的传输，有效减少通信代价. 其次，设计细粒度结构化块划分剪枝算法，于输入通道维度进行块内裁剪来获得稀疏且规则的权重矩阵，借此显著降低计算规模和DSP 乘法器等资源占用. 然后，提出一种兼容深度可分离卷积的输入通道维度动态拓展及运行时调度策略，实现输入通道参数灵活适配与逐通道卷积和逐点卷积的资源复用. 最后，提出一种计算图重构及硬件算子融合优化方法，提升硬件执行效率.实验采用2种资源受限的低端FPGA异构平台Intel CycloneV与Xilinx ZU3EG，结果表明SAF-CNN加速器可分别实现76.3GOPS与494.3GOPS的计算性能. 与多核CPU相比，SAF-CNN在进行SSD_MobileNetV1目标模型检测时，可实现3.5倍与2.2倍的性能提升，模型推理速度高达26.5fps.

Abstract: When deploying models on resource-constrained FPGAs, traditional convolutional neural network accelerators and inference frameworks often face challenges such as various device types, extremely limited resources, insufficient data bandwidth utilization, complex operator types that are difficult to match operators and schedule computing task reasonably. In this paper, a sparse acceleration framework of convolutional neural network (SAF-CNN) for embedded FPGA is proposed. Through the method of software and hardware co-design, SAF-CNN is jointly optimized from the two perspectives of hardware accelerator design and software inference framework. SAF-CNN first constructs parallel computing array and designs parallel encoding and decoding scheme to realize single-period multi-data transmission and effectively reduce communication costs. Secondly, a fine-grained structured block partitioning pruning algorithm is designed to obtain a sparse and regular weight matrix by cutting the input channel dimension within the block, so as to significantly reduce the computation scale and the resource utilization of DSP multiplier. Then, the input channel dimension dynamic expansion method and runtime scheduling strategy compatible with depth-separable convolution is proposed to realize flexible adaptation of input channel parameters and resource reuse of point-wise convolution and depth-wise convolution. Finally, the computational graph reconstruction method and hardware operator fusion are used to improve the hardware execution efficiency. The experiments use two resource-limited low-end FPGA heterogeneous platforms, Intel CycloneV and Xilinx ZU3EG. The results show that the SAF-CNN accelerator can achieve the computational performance of 76.3GOPS and 494.3GOPS respectively. Compared with multi-core CPU, SAF-CNN can achieve 3.5x and 2.2x performance improvement on the object detection model of SSD_MobileNetV1, and the model inference speed is up to 26.5fps.

HTML全文

参考文献(53)

施引文献

资源附件(0)