面向低精度神经网络的数据流体系结构优化

范志华; 吴欣欣; 李文明; 曹华伟; 安学军; 叶笑春; 范东睿

doi:10.7544/issn1000-1239.202111275

面向低精度神经网络的数据流体系结构优化

范志华^{1, 2,},
吴欣欣^{1, 2,},
李文明^1, ,,
曹华伟^1,,
安学军^{1, 2,},
叶笑春^{1, 2,},
范东睿^{1, 2,}

1.
处理器芯片全国重点实验室（中国科学院计算技术研究所）　北京　100190
2.
中国科学院大学计算机科学与技术学院　北京　100049

基金项目: 中国科学院战略性先导科技专项（C类）（XDC05000000）；国家自然科学基金项目（61732018，61872335）；中国科学院国际伙伴计划项目（171111KYSB20200002）；之江实验室开放项目（2022PB0AB01）；中国科学院青年创新促进会

详细信息

作者简介:
范志华: 1996年生.博士研究生.CCF学生会员.主要研究方向为数据流体系结构及高通量计算架构

吴欣欣: 1992年生.博士.CCF学生会员.主要研究方向为神经网络架构、数据流架构

李文明: 1988年生.博士，副研究员，硕士生导师.CCF高级会员.主要研究方向为高通量计算架构及软件模拟技术

曹华伟: 1989年生.博士，副研究员.CCF会员.主要研究方向为并行计算及高通量计算架构

安学军: 1966年生.博士，正高级工程师，博士生导师.CCF会员.主要研究方向为计算机系统结构、高性能互联网络

叶笑春: 1981年生.博士，研究员.CCF高级会员.主要研究方向为高通量计算架构及软件模拟技术

范东睿: 1979年生.博士，研究员，博士生导师.CCF杰出会员.主要研究方向为高通量、高性能众核处理器微结构

通讯作者:
李文明（liwenming@ict.ac.cn）

中图分类号: TP183
计量
- 文章访问数: 331
- HTML全文浏览量: 35
- PDF下载量: 233
出版历程
- 收稿日期: 2021-12-23
- 修回日期: 2022-06-06
- 网络出版日期: 2023-02-10
- 刊出日期: 2022-12-31

Dataflow Architecture Optimization for Low-Precision Neural Networks

Fan Zhihua^{1, 2,},
Wu Xinxin^{1, 2,},
Li Wenming^1, ,,
Cao Huawei^1,,
An Xuejun^{1, 2,},
Ye Xiaochun^{1, 2,},
Fan Dongrui^{1, 2,}

1.
State Key Lab of Processors （Institute of Computing Technology，Chinese Academy of Sciences），Beijing 100190
2.
School of Computer Science and Technology，University of Chinese Academy of Sciences，Beijing 100049

Funds: This work was supported by the Strategic Priority Research Program of Chinese Academy of Sciences (XDC05000000), the National Natural Science Foundation of China (61732018, 61872335), the International Partnership Program of Chinese Academy of Sciences (171111KYSB20200002), the Open Project of Zhejiang Lab (2022PB0AB01), and the Youth Innovation Promotion Association of Chinese Academy of Sciences.

摘要

摘要:
数据流架构的执行方式与神经网络算法具有高度匹配性，能充分挖掘数据的并行性. 然而，随着神经网络向更低精度的发展，数据流架构的研究并未面向低精度神经网络展开，在传统数据流架构部署低精度（INT8，INT4或者更低）神经网络时,会面临3个问题：1）传统数据流架构的计算部件数据通路与低精度数据不匹配，无法体现低精度神经网络的性能和能效优势；2）向量化并行计算的低精度数据在片上存储中要求顺序排列，然而它在片外存储层次中是分散排列的，使得数据的加载和写回操作变得复杂，传统数据流架构的访存部件无法高效支持这种复杂的访存模式；3）传统数据流架构中使用双缓冲机制掩盖数据的传输延迟，但是，当传输低精度数据时，传输带宽的利用率显著降低，导致计算延迟无法掩盖数据传输延迟，双缓冲机制面临失效风险，进而影响数据流架构的性能和能效.为解决这3个问题，设计了面向低精度神经网络的数据流加速器DPU_Q.首先，设计了灵活可重构的计算单元，根据指令的精度标志位动态重构数据通路，一方面能高效灵活地支持多种低精度数据运算，另一方面能进一步提高计算并行性和吞吐量. 另外，为解决低精度神经网络复杂的访存模式，设计了Scatter引擎，该引擎将在低层次或者片外存储中地址空间离散分布的低精度数据进行拼接、预处理，以满足高层次或者片上存储对数据排列的格式要求.同时，Scatter引擎能有效解决传输低精度数据时带宽利用率低的问题，解决了双缓冲机制失效的问题.最后，从软件方面提出了基于数据流执行模式的低精度神经网络映射算法，兼顾负载均衡的同时能对权重、激活值数据进行充分复用，减少了访存和数据流图节点间的数据传输开销.实验表明，相比于同精度的GPU（Titan Xp）、数据流架构（Eyeriss）和低精度神经网络加速器（BitFusion），DPU_Q分别获得3. 18倍、6.05倍、1.52倍的性能提升和4.49倍、1.6倍、1.13倍的能效提升.
- 数据流架构 /
- 低精度神经网络 /
- 量化 /
- 可重构架构 /
- 直接内存访问
Abstract:
The execution model of the dataflow architecture is similar to the execution of neural network algorithm, which can exploit more parallelism. However, with the development of low-precision neural networks, the research on dataflow architecture has not been developed for low-precision neural networks. When low-precision (INT8, INT4 or lower) neural networks are deployed in traditional dataflow architectures, they will face the following three challenges: 1) The data path of the traditional dataflow architecture does not match the low-precision data, which cannot reflect the performance and energy efficiency advantages of the low-precision neural networks. 2) Vectorized low-precision data are required to be arranged in order in the on-chip memory, but these data are arranged in a scattered manner in the off-chip memory hierarchy, which makes data loading and writing back operations more complicated. The memory access components of the traditional dataflow architecture cannot support this complex memory access mode efficiently. 3) In traditional dataflow architecture, the double buffering mechanism is used to conceal the transmission delay. However, when low-precision data are transmitted, the utilization of the transmission bandwidth is significantly reduced, resulting in calculation delays that cannot cover the data transmission delay, and the double buffering mechanism faces the risk of failure, thereby affecting the performance and energy efficiency of the dataflow architecture. In order to solve the above problems, we optimize the dataflow architecture and design a low-precision neural networks accelerator named DPU_Q. First of all, a flexible and reconfigurable computing unit is designed, which dynamically reconstructs the data path according to the precision flag of the instruction. On the one hand, it can efficiently and flexibly support a variety of low-precision operations. On the other hand, the performance and throughput of the architecture can be further improved in this way. In addition, in order to solve the complex memory access mode of low-precision data, we design Scatter engine, which can splice and preprocess the low-precision data discretely distributed in the off-chip/low-level memory hierarchy to meet the format requirements of the on-chip/high-level memory hierarchy for data arrangement. At the same time, Scatter engine can effectively solve the problem of reduced bandwidth utilization when transmitting low-precision data. The transmission delay will not increase significantly, so it can be completely covered by the double buffer mechanism. Finally, a low-precision neural network scheduling method is proposed, which can fully reuse weights, activation values, reducing memory access overhead. Experiments show that compared with the same precision GPU (Titan Xp), state-of-the-art dataflow architecture (Eyeriss) and state-of-the-art low-precision neural network accelerator (BitFusion), DPU_Q achieves 3.18 $\times$ , 6.05 $\times$ , and 1.52 $\times$ of performance improvement and 4.49 $\times$ , 1.6 $\times$ , and 1.13 $\times$ of energy efficiency improvement, respectively.
- dataflow architecture /
- low-precision neural network /
- quantization /
- reconfigurable architecture /
- direct memory access

HTML全文

图 1 神经网络模型准确率损失与数据位宽的关系

Figure 1. Relationship between the accuracy loss of the neural network model and the quantization of bit-width

下载: 全尺寸图片幻灯片

图 2 DPU架构

Figure 2. Overall architecture of DPU

下载: 全尺寸图片幻灯片

图 3 卷积运算的数据通路

Figure 3. The data path of convolution

下载: 全尺寸图片幻灯片

图 4 低精度数据的存储访问模式

Figure 4. Memory access patterns in low precision data

下载: 全尺寸图片幻灯片

图 5 AlexNet网络不同精度的计算和传输开销

Figure 5. Calculation and transmission overhead of different precisions in AlexNet

下载: 全尺寸图片幻灯片

图 6 DPU_Q总体架构

Figure 6. The overall architecture of DPU_Q

下载: 全尺寸图片幻灯片

图 7 DPU_Q中低精度卷积的数据通路

Figure 7. The data path of low-precision convolution in DPU_Q

下载: 全尺寸图片幻灯片

图 8 Scatter引擎结构图

Figure 8. The structure of the Scatter engine

下载: 全尺寸图片幻灯片

图 9 AlexNet和VGG16各层的数据传输和执行时间占比

Figure 9. Data transmission and execution time proportion of each layer of AlexNet and VGG16

下载: 全尺寸图片幻灯片

图 10 DPU_Q相对于DPU的加速比提升

Figure 10. Speedup promotion of DPU_Q over DPU

下载: 全尺寸图片幻灯片

图 11 DPU_Q相对于Eyeriss的加速比

Figure 11. Speedup of DPU_Q over Eyeriss

下载: 全尺寸图片幻灯片

图 12 DPU_Q，BitFusion，Eyeriss相对于GPU的性能对比

Figure 12. Performance comparison of DPU_Q，BitFusion，Eyeriss over GPU

下载: 全尺寸图片幻灯片

图 13 能效对比

Figure 13. Energy efficiency comparison

下载: 全尺寸图片幻灯片

图 14 DPU_Q的面积和功耗分布

Figure 14. Distribution of area and power consumption of DPU_Q

下载: 全尺寸图片幻灯片

表 1 代表性低精度神经网络加速器

Table 1 Representative Low-Precision Neural Network Accelerators

加速器	支持多精度	灵活性	设计重点
Eyeriss^[7]	否	较好	数据流
AQSS^[4]	否	较差	计算
OLAccel^[5]	否	较差	计算
DRQ^[3]	是	较差	计算
BitFusion^[6]	是	较差	计算
DPU_Q	是	较好	计算、访存

下载: 导出CSV

表 2 代表性低精度神经网络

Table 2 Representative Low-Precision Neural Networks

模型	量化对象	精度/b
Q_CNN^[13]	权值	4
EIA^[14]	权值/激活值	8
DFP^[15]	权值/激活值	4,8
LSQ^[16]	权值/激活值	4,16
QIL^[17]	权值/激活值	4

下载: 导出CSV

表 3 DPU_Q的配置信息

Table 3 Configuration Information of DPU_Q

模块	配置信息
微控制器	ARM 核
PE	8×8, SIMD8, 1 GHz, 8 KB 指令缓存, 32KB数据缓存
片上网络	2维 mesh，1套访存网络， 1套控制网络，1套PE间通信网络
片外存储	DDR3,1333MHz
SPM/MB	6

下载: 导出CSV

表 4 测试程序信息

Table 4 Benchmark Information

CNN 模型	卷积层	特征矩阵规模 ( H × W × C)	卷积核规模 (R × S ×M)
AlexNet	conv1	227 × 227 × 3	11 × 11 × 96
	conv2	31 × 31 × 96	5 × 5 × 256
	conv3	15 × 15 × 256	3 × 3 × 384
	conv4	15 × 15 × 384	3 × 3 × 384
	conv5	15 × 15 × 384	3 × 3 × 256
VGG16	conv1_1	224 × 224 × 3	3 × 3 × 64
	conv1_2	224 × 224 × 64	3 × 3 × 64
	conv2_1	112 × 112 × 64	3 × 3 × 128
	conv2_2	112 × 112 × 128	3 × 3 × 128
	conv3_1	56 × 56 × 128	3 × 3 × 256
	conv3_2	56 × 56 × 256	3 × 3 × 256
	conv3_3	56 × 56 × 256	3 × 3 × 256
	conv4_1	28 × 28 × 256	3 × 3 × 512
	conv4_2	28 × 28 × 512	3 × 3 × 512
	conv4_3	28 × 28 × 512	3 × 3 × 512
	conv5_1	14 × 14 × 512	3 × 3 × 512
	conv5_2	14 × 14 × 512	3 × 3 × 512
	conv5_3	14 × 14 × 512	3 × 3 × 512
注：H，W，C分别表示特征矩阵的高、宽、通道；R，S，M分别表示卷积核的高、宽、通道.

下载: 导出CSV

参考文献(19)

[1]	Krizhevsky A, Sutskever I, Hinton G. Imagenet classification with deep convolutional neural networks[C] //Proc of the 25th Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2012: 1097−1105
[2]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint, arXiv: 1409.1556, 2014
[3]	Song Zhuoran, Fu Bangqi, Wu Feiyang, et al. DRQ: Dynamic region-based quantization for deep neural network acceleration[C] //Proc of the 47th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2020: 1010−1021
[4]	Ueki T, Keisuke I, Matsubara T, et al. AQSS: Accelerator of quantization neural networks with stochastic approach[C] //Proc of the 6th Int Symp on Computing and Networking Workshops. Piscataway, NJ: IEEE, 2018: 138−144
[5]	Park E, Kim D, Yoo S. Energy-efficient neural network accelerator based on outlier-aware low-precision computation[C] //Proc of the 45th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2018: 688−698
[6]	Sharma H, Park J, Suda N, et al. BitFusion: Bit-level dynamically composable architecture for accelerating deep neural network[C] //Proc of the 45th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2018: 764−775
[7]	Chen Yu-Hsin, Krishna T, Emer J, et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks[J]. IEEE Journal of Solid-State Circuits, 2017, 52(1): 127−138 doi: 10.1109/JSSC.2016.2616357
[8]	Wu Xinxin, Fan Zhihua, Liu Tianyu, et al. LRP: Predictive output activation based on SVD approach for CNNs acceleration[C] //Proc of the 25th Design, Automation & Test in Europe. Piscataway, NJ: IEEE, 2022: 837−842
[9]	Courbariaux M, Bengio Y, David J. BinaryConnect: Training deep neural networks with binary weights during propagations[C] //Proc of the 28th Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2015: 3123−3131
[10]	Zhang Donqing, Yang Jiaolong, Ye Dongqiangzi, et al. LQ-Nets: Learned quantization for highly accurate and compact deep neural networks[C] //Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 365−382
[11]	Wang Naigang, Choi J, Brand D, et al. Training deep neural networks with 8-bit floating point numbers[C] //Proc of the 31st Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2018: 7685−7694
[12]	Chen Tianshi, Du Zidong, Sun Ninghui, et al. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning[C] //Proc of the 19th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2014: 269−284
[13]	Wu Jiaxiang, Leng Cong, Wang Yonghang, et al. Quantized convolutional neural networks for mobile devices[C] //Proc of the 29th Int Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 4820−4828
[14]	Park E, Yoo S, Vajda P. Value-aware quantization for training and inference of neural networks[C] //Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 608−624
[15]	Deng Lei, Li Guoqi, Han Song, et al. Model compression and hardware acceleration for neural networks: A comprehensive survey[J]. Proceedings of IEEE, 2020, 108(4): 485−532 doi: 10.1109/JPROC.2020.2976475
[16]	Jung S, Son C, Lee S, et al. Learning to quantize deep networks by optimizing quantization intervals with task loss[C] //Proc of the 31st Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 4350−4359
[17]	Szegedy C, Liu Wei, Jia Yangqing, et al. Going deeper with convolutions[C] //Proc of the 27th Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 1−9
[18]	Han Song, Mao Huizi, Dally J. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding[J]. arXiv preprint, arXiv: 1510.00149v2, 2015
[19]	Ye Xiaochun, Fan Dongrui, Sun Ninghui, et al. SimICT: A fast and flexible framework for performance and power evaluation of large-scale architecture[C] //Proc of the 18th Int Symp on Low Power Electronics and Design (ISLPED). Piscataway, NJ: IEEE, 2013: 273−278