SAF-CNN：A Sparse Acceleration Framework of Convolutional Neural Network forEmbedded FPGAs

Xie Kunpeng; Yi Dezhi; Liu Yiqing; Liu Hang; He Xinyu; Gong Cheng; Lu Ye

doi:10.7544/issn1000-1239.202220735

Journal of Computer Research and Development > 2023 > 60(5): 1053-1072. > DOI: 10.7544/issn1000-1239.202220735 CSTR: 32373.14.issn1000-1239.202220735

Xie Kunpeng, Yi Dezhi, Liu Yiqing, Liu Hang, He Xinyu, Gong Cheng, Lu Ye. SAF-CNN：A Sparse Acceleration Framework of Convolutional Neural Network forEmbedded FPGAs[J]. Journal of Computer Research and Development, 2023, 60(5): 1053-1072. DOI: 10.7544/issn1000-1239.202220735

Citation:

PDF (2287 KB)

SAF-CNN：A Sparse Acceleration Framework of Convolutional Neural Network forEmbedded FPGAs

Xie Kunpeng^{1, 4,},
Yi Dezhi^{2, 4},
Liu Yiqing^{2, 4},
Liu Hang^{1, 4},
He Xinyu^{2, 4},
Gong Cheng³,
Lu Ye^{1, 2, 4, 5, ,}

1.
College of Computer Science, Nankai University, Tianjin 300350
2.
College of Cyber Science, Nankai University, Tianjin 300350
3.
College of Software, Nankai University, Tianjin 300350
4.
Tianjin Key Laboratory of Network and Data Security Technology(Nankai University), Tianjin 300350
5.
State Key Lab of Processors (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190

Funds: This work was supported by the National Natural Science Foundation of China(62002175), the Open Project Fund of State Key Laboratory of Computer Architecture (Institute of Computing Technology, Chinese Academy of Sciences) (CARCHB202016), the Special Funding for Excellent Enterprise Technology Correspondent of Tianjin(21YDTPJC00380), the Open Project Foundation of Information Security Evaluation Center of Civil Aviation University of China (ISECCA-202102), and the CCF-Huawei Populus Grove Fund(CCF-HuaweiTC2022005).

More Information

Author Bio:
Xie Kunpeng: born in 1996. PhD candidate. Member of CCF. His main research interests include heterogeneous computing，machine learning, and embedded system

Yi Dezhi: born in 1999. Master candidate. Member of CCF. His main research interests include deep learning and heterogeneous computing

Liu Yiqing: born in 1994. Master candidate. His main research interests include DNN model automatic pruning, model transformation tool, and FPGA deployment implementation

Liu Hang: born in 2000. Master candidate. His main research interests include machine learning and FPGA deployment implementation

He Xinyu: born in 1996. PhD candidate. Member of CCF. His main research interests include computer vision，machine learning, and model compression and acceleration

Gong Cheng: born in 1993. PhD, lecturer. His main research interests include heterogeneous computing, machine learning, and Internet of things

Lu Ye: born in 1986. PhD, associate professor. Senior member of CCF. His main research interests include high performance embedded system, heterogeneous computing, and artificial intelligence
Received Date: August 15, 2022
Revised Date: March 30, 2023
Available Online: April 09, 2023

Graphical Abstract

Abstract

Abstract

When deploying models on resource-constrained FPGAs, traditional convolutional neural network accelerators and inference frameworks often face challenges such as various device types, extremely limited resources, insufficient data bandwidth utilization, complex operator types that are difficult to match operators and schedule computing task reasonably. In this paper, a sparse acceleration framework of convolutional neural network (SAF-CNN) for embedded FPGA is proposed. Through the method of software and hardware co-design, SAF-CNN is jointly optimized from the two perspectives of hardware accelerator design and software inference framework. SAF-CNN first constructs parallel computing array and designs parallel encoding and decoding scheme to realize single-period multi-data transmission and effectively reduce communication costs. Secondly, a fine-grained structured block partitioning pruning algorithm is designed to obtain a sparse and regular weight matrix by cutting the input channel dimension within the block, so as to significantly reduce the computation scale and the resource utilization of DSP multiplier. Then, the input channel dimension dynamic expansion method and runtime scheduling strategy compatible with depth-separable convolution is proposed to realize flexible adaptation of input channel parameters and resource reuse of point-wise convolution and depth-wise convolution. Finally, the computational graph reconstruction method and hardware operator fusion are used to improve the hardware execution efficiency. The experiments use two resource-limited low-end FPGA heterogeneous platforms, Intel CycloneV and Xilinx ZU3EG. The results show that the SAF-CNN accelerator can achieve the computational performance of 76.3GOPS and 494.3GOPS respectively. Compared with multi-core CPU, SAF-CNN can achieve 3.5x and 2.2x performance improvement on the object detection model of SSD_MobileNetV1, and the model inference speed is up to 26.5fps.
- convolutional neural network,
- model compression,
- computational graph,
- accelerator design,
- inference framework

FullText(HTML)

References (53)

References

[1]	Shafique M, Theocharides T, Reddy V J, et al. TinyML: Current progress, research challenges, and future roadmap[C]//Proc of the 58th ACM/IEEE Design Automation Conf (DAC). Piscataway, NJ: IEEE, 2021: 1303−1306
[2]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint, arXiv: 1409.1556, 2014
[3]	Ren S, He Kaiming, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(6): 1137−1149
[4]	He Kaiming, Gkioxari G, Dollár P, et al. Mask R-CNN[C]//Proc of the 16th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 2961−2969
[5]	Li Guang, Wang Peisong, Liu Zejian, et al. Hardware acceleration of CNN with one-hot quantization of weights and activations [C]//Proc of the 23rd Design, Automation & Test in Europe Conf & Exhibition (DATE). Piscataway, NJ: IEEE, 2020: 971−974
[6]	Cong J, Xiao Bingjun. Minimizing computation in convolutional neural networks[C]//Proc of the 24th Int Conf on Artificial Neural Networks. Berlin: Springer, 2014: 281−290
[7]	Prost-Boucle A, Bourge A, Pétrot F, et al. Scalable high-performance architecture for convolutional ternary neural networks on FPGA[C/OL]//Proc of the 27th Int Conf on Field Programmable Logic and Applications (FPL). Piscataway, NJ: IEEE, 2017[2019-06-10].https://ieeexplore.ieee.org/document/8056850
[8]	Mousouliotis P G, Petrou L P. CNN-Grinder: From algorithmic to high-level synthesis descriptions of CNNs for low-end-low-cost FPGA SoCs[J]. Microprocessors and Microsystems, 2020, 73: 102990
[9]	陈桂林,马胜,郭阳. 硬件加速神经网络综述[J]. 计算机研究与发展,2019,56(2):240−253 doi: 10.7544/issn1000-1239.2019.20170852 Chen Guilin, Ma Sheng, Guo Yang. Survey on accelerating neural network with hardware[J]. Journal of Computer Research and Development, 2019, 56(2): 240−253 (in Chinese) doi: 10.7544/issn1000-1239.2019.20170852
[10]	Mishra R, Gupta H P, Dutta T. A survey on deep neural network compression: Challenges, overview, and solutions[J]. arXiv preprint, arXiv: 2010.03954, 2020
[11]	Courbariaux M, Hubara I, Soudry D, et al. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1[J]. arXiv preprint, arXiv: 1602.02830, 2016
[12]	Li Fengfu, Zhang Bo, Liu Bin. Ternary weight networks[J]. arXiv preprint, arXiv: 1605.04711, 2016
[13]	Gondimalla A, Chesnut N, Thottethodi M, et al. SparTen: A sparse tensor accelerator for convolutional neural networks[C]//Proc of the 52nd Annual IEEE/ACM Int Symp on Microarchitecture. New York: ACM, 2019: 151−165
[14]	谢坤鹏,卢冶,靳宗明,等. FAQ-CNN:面向量化卷积神经网络的嵌入式FPGA可扩展加速框架[J]. 计算机研究与发展,2022,59(7):1409−1427 doi: 10.7544/issn1000-1239.20210142 Xie Kunpeng, Lu Ye, Jin Zongming, et al. FAQ-CNN: A flexible acceleration framework for quantized convolutional neural networks on embedded FPGAs[J]. Journal of Computer Research and Development, 2022, 59(7): 1409−1427 (in Chinese) doi: 10.7544/issn1000-1239.20210142
[15]	Chollet F. Xception: Deep learning with depthwise separable convolutions[C]//Proc of the 30th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 1251−1258
[16]	Abadi M, Barham P, Chen Jianmin, et al. TensorFlow: A system for large-scale machine learning[C]//Proc of the 12th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX, 2016: 265−283
[17]	Paszke A, Gross S, Massa F, et al. PyTorch: An imperative style, high-performance deep learning library[C]// Proc of the 32nd Conf on Neural Information Processing Systems. New York: Curran Associates, 2019: 8024−8035
[18]	Zhang Chen, Sun Guangyu, Fang Zhenman, et al. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018, 38(11): 2072−2085
[19]	Howard A G, Zhu Menglong, Chen Bo, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications[J]. arXiv preprint, arXiv: 1704.04861, 2017
[20]	Wu Di, Zhang Yu, Jia Xijie, et al. A high-performance CNN processor based on FPGA for MobileNets[C]//Proc of the 29th Int Conf on Field Programmable Logic and Applications (FPL). Piscataway, NJ: IEEE, 2019: 136−143
[21]	Ding Wei, Huang Zeyu, Huang Zunkai, et al. Designing efficient accelerator of depthwise separable convolutional neural network on FPGA[J]. Journal of Systems Architecture, 2019, 97: 278−286 doi: 10.1016/j.sysarc.2018.12.008
[22]	Wei Xuechao, Yu C H, Zhang Peng, et al. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs[C/OL]//Proc of the 54th Annual Design Automation Conf. New York: ACM, 2017[2020-05-08].https://ieeexplore.ieee.org/document/8060313
[23]	Cong J, Wei Peng, Yu C H, et al. Automated accelerator generation and optimization with composable, parallel and pipeline architecture[C/OL]//Proc of the 55th ACM/ESDA/IEEE Design Automation Conf (DAC). Piscataway, NJ: IEEE, 2018[2021-01-03].https://ieeexplore.ieee.org/document/8465940
[24]	Véstias M, Duarte R P, de Sousa J T, et al. A fast and scalable architecture to run convolutional neural networks in low density FPGAs[J/OL]. Microprocessors and Microsystems, 2020, 77 [2021-12-04].https://www.sciencedirect.com/science/article/pii/S0141933120303033
[25]	百度. 飞浆: 源于产业实践的开源深度学习平台 [EB/OL]. [2020-08-10].https://www.paddlepaddle.org.cn/ Baidu. PaddlePaddle: An open source deep learning platform derived from industrial practice[EB/OL]. [2020-08-10].https://www.paddlepaddle.org.cn/ (in Chinese)
[26]	Zhou Shuchang, Wu Yuxin, Ni Zekun, et al. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients[J]. arXiv preprint, arXiv: 1606.06160, 2016
[27]	Miyashita D, Lee E H, Murmann B. Convolutional neural networks using logarithmic data representation[J]. arXiv preprint, arXiv: 1603. 01025, 2016
[28]	Han Song, Mao Huizi, Dally W J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding[J]. arXiv preprint, arXiv: 1510.00149, 2015
[29]	Liu Zhuang, Li Jianguo, Shen Zhiqiang, et al. Learning efficient convolutional networks through network slimming[C]//Proc of the 16th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 2736−2744
[30]	Ma Xiaolong, Lin Sheng, Ye Shaokai, et al. Non-structured DNN weight pruning−Is it beneficial in any platform?[J/OL]. IEEE Transactions on Neural Networks and Learning Systems, 2021[2022-05-01].https://ieeexplore.ieee.org/abstract/document/9381660
[31]	Song Linghao, Chi Yuze, Guo Licheng, et al. Serpens: A high bandwidth memory based accelerator for general-purpose sparse matrix-vector multiplication[C]//Proc of the 59th ACM/IEEE Design Automation Conf (DAC). Piscataway, NJ: IEEE, 2022: 211−216
[32]	Li Shiyu, Hanson E, Qian Xuehai, et al. ESCALATE: Boosting the efficiency of sparse CNN accelerator with kernel decomposition[C]//Proc of the 54th Annual IEEE/ACM Int Symp on Microarchitecture. New York: ACM, 2021: 992−1004
[33]	Li Hao, Kadav A, Durdanovic I, et al. Pruning filters for efficient convnets[J]. arXiv preprint, arXiv: 1608.08710, 2016
[34]	Tan Zhanhong, Song Jiebo, Ma Xiaolong, et al. PCNN: Pattern-based fine-grained regular pruning towards optimizing CNN accelerators[C/OL]//Proc of the 57th ACM/IEEE Design Automation Conf (DAC). Piscataway, NJ: IEEE, 2020[2021-10-03].https://dl.acm.org/doi/10.5555/3437539.3437730
[35]	Mao Huizi, Han Song, Pool J, et al. Exploring the granularity of sparsity in convolutional neural networks[C]//Proc of the 30th IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops. Los Alamitos, CA: IEEE Computer Society, 2017: 1927−1934
[36]	Zhang Chen, Li Peng, Sun Guangyu, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]//Proc of the 23rd ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2015: 161−170
[37]	卢冶,陈瑶,李涛,等. 面向边缘计算的嵌入式FPGA卷积神经网络构建方法[J]. 计算机研究与发展,2018,55(3):551−562 doi: 10.7544/issn1000-1239.2018.20170715 Lu Ye, Chen Yao, Li Tao, et al. Convolutional neural network construction method for embedded FPGAs oriented edge computing[J]. Journal of Computer Research and Development, 2018, 55(3): 551−562 (in Chinese) doi: 10.7544/issn1000-1239.2018.20170715
[38]	Lu Liqiang, Liang Yun, Xiao Qingcheng, et al. Evaluating fast algorithms for convolutional neural networks on FPGAs[C]//Proc of the 25th Annual Int Symp on Field-Programmable Custom Computing Machines (FCCM). Piscataway, NJ: IEEE, 2017: 101−108
[39]	Wu Haoning, Huang C T. Data locality optimization of depthwise separable convolutions for CNN inference accelerators[C]//Proc of the 22nd Design, Automation & Test in Europe Conf & Exhibition (DATE). Piscataway, NJ: IEEE, 2019: 120−125
[40]	Zhang Xiaofan, Lu Haoming, Hao Cong, et al. SkyNet: A hardware-efficient method for object detection and tracking on embedded systems[J]. Proceedings of Machine Learning and Systems, 2020, 2: 216−229
[41]	Yan Shun, Liu Zhengyan, Wang Yun, et al. An FPGA-based MobileNet accelerator considering network structure characteristics[C]//Proc of the 31st Int Conf on Field Programmable Logic and Applications (FPL). Piscataway, NJ: IEEE, 2021: 17−23
[42]	Du Zidong, Fasthuber R, Chen Tianshi, et al. ShiDianNao: Shifting vision processing closer to the sensor[C]//Proc of the 42nd Annual Int Symp on Computer Architecture. New York: ACM, 2015: 92−104
[43]	Zhang Zhichao, Mahmud M A P, Kouzani A Z. FitNN: A low-resource FPGA-based CNN accelerator for drones[J/OL]. IEEE Internet of Things Journal, 2022[2022-08-01].https://ieeexplore.ieee.org/abstract/document/9785605
[44]	Chen Tianqi, Moreau T, Jiang Ziheng, et al. TVM: An automated end-to-end optimizing compiler for deep learning[C]//Proc of the 13th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2018: 578−594
[45]	Li Rengang, Kan Hongwei, Su Dongdong, et al. An optimal design method of Conv2d operator for TensorFlow based on FPGA accelerator[C/OL]//Proc of the 4th Int Conf on Computer Science and Application Engineering. New York: ACM, 2020[2022-08-03]. https://dl.acm.org/doi/10.1145/3424978.3424987
[46]	Nunez-Yanez J. Fused architecture for dense and sparse matrix processing in TensorFlow Lite[J/OL]. IEEE Micro, 2022[2022-08-15].https://ieeexplore.ieee.org/abstract/document/9851516
[47]	Martone M, Filippone S, Tucci S, et al. Use of hybrid recursive csr/coo data structures in sparse matrix-vector multiplication[C]//Proc of the Int Multiconference on Computer Science and Information Technology. Piscataway, NJ: IEEE, 2010: 327−335
[48]	Russakovsky O, Deng Jia, Su Hao, et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211−252 doi: 10.1007/s11263-015-0816-y
[49]	Everingham M, Eslami S M, Van Gool L, et al. The pascal visual object classes challenge: A retrospective[J]. International Journal of Computer Vision, 2015, 111(1): 98−136 doi: 10.1007/s11263-014-0733-5
[50]	He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition[C]//Proc of the 29th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 770−778
[51]	Suda N, Chandra V, Dasika G, et al. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks[C]//Proc of the 24th ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2016: 16−25
[52]	Wang Junsong, Lou Qiuwen, Zhang Xiaofan, et al. Design flow of accelerating hybrid extremely low bit-width neural network in embedded FPGA[C]//Proc of the 28th Int Conf on Field Programmable Logic and Applications (FPL). Piscataway, NJ: IEEE, 2018: 163−1636
[53]	Li Huimin, Fan Xitian, Jiao Li, et al. A high performance FPGA-based accelerator for large-scale convolutional neural networks[C/OL]//Proc of the 26th Int Conf on Field Programmable Logic and Applications (FPL). Piscataway, NJ: IEEE, 2016[2019-05-20].https://ieeexplore.ieee.org/document/7577308