FAQ-CNN:面向量化卷积神经网络的嵌入式FPGA可扩展加速框架

谢坤鹏; 卢冶; 靳宗明; 刘义情; 龚成; 陈新伟; 李涛

doi:10.7544/issn1000-1239.20210142

FAQ-CNN:面向量化卷积神经网络的嵌入式FPGA可扩展加速框架

FAQ-CNN: A Flexible Acceleration Framework for Quantized Convolutional Neural Networks on Embedded FPGAs

摘要

摘要: 卷积神经网络(convolutional neural network, CNN)模型量化可有效压缩模型尺寸并提升CNN计算效率.然而，CNN模型量化算法的加速器设计，通常面临算法各异、代码模块复用性差、数据交换效率低、资源利用不充分等问题.对此，提出一种面向量化CNN的嵌入式FPGA加速框架FAQ-CNN，从计算、通信和存储3方面进行联合优化，FAQ-CNN以软件工具的形式支持快速部署量化CNN模型.首先，设计面向量化算法的组件，将量化算法自身的运算操作和数值映射过程进行分离；综合运用算子融合、双缓冲和流水线等优化技术，提升CNN推理任务内部的并行执行效率.然后，提出分级编码与位宽无关编码规则和并行解码方法，支持低位宽数据的高效批量传输和并行计算.最后，建立资源配置优化模型并转为整数非线性规划问题，在求解时采用启发式剪枝策略缩小设计空间规模.实验结果表明，FAQ-CNN能够高效灵活地实现各类量化CNN加速器.在激活值和权值为16 b时，FAQ-CNN的加速器计算性能是Caffeine的1.4倍；在激活值和权值为8 b时，FAQ-CNN可获得高达1.23TOPS的优越性能.

Abstract: Quantization can compress convolutional neural network (CNN) model size and improve computing efficiency. However, the existing accelerator designs for CNN quantization are usually faced with the challenges of various algorithms, poor reusability of code modules, low efficiency of data exchange and insufficient utilization of resources, and so on. To meet these challenges, we propose a flexible acceleration framework for the quantized CNNs named FAQ-CNN to optimize accelerator design from three aspects of computing, communication and storage. FAQ-CNN can support rapid deployment of quantized CNN model in the form of software tools. Firstly, a component for quantization algorithms is designed to separate the calculation part from the process of value projection in quantization algorithm; the optimization techniques such as operator fusion, double buffering and pipeline are also utilized to improve the execution efficiency of CNN inference task in parallel. Then, the hierarchical and bitwidth-independent encoding and parallel decoding method are both proposed to efficiently support batch transmission and parallel computing for low bitwidth data. Finally, the resource allocation optimization model which can be transformed into an integer nonlinear programming problem is established for FAQ-CNN; the heuristic pruning strategy is used to reduce design space size. The extensive experimental results show that FAQ-CNN can support almost all kinds of quantized CNN accelerators efficiently and flexibly. When the activation and weight value are set to 16 b, the computing performance of FAQ-CNN accelerator is 1.4 times that of the Caffeine. When 8 b configuration is applied, FAQ-CNN can achieve the superior performance by 1.23TOPS.

HTML全文

参考文献(0)

施引文献

资源附件(0)