Loading [MathJax]/jax/output/SVG/jax.js
  • 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

面向低精度神经网络的数据流体系结构优化

范志华, 吴欣欣, 李文明, 曹华伟, 安学军, 叶笑春, 范东睿

范志华, 吴欣欣, 李文明, 曹华伟, 安学军, 叶笑春, 范东睿. 面向低精度神经网络的数据流体系结构优化[J]. 计算机研究与发展, 2023, 60(1): 43-58. DOI: 10.7544/issn1000-1239.202111275
引用本文: 范志华, 吴欣欣, 李文明, 曹华伟, 安学军, 叶笑春, 范东睿. 面向低精度神经网络的数据流体系结构优化[J]. 计算机研究与发展, 2023, 60(1): 43-58. DOI: 10.7544/issn1000-1239.202111275
Fan Zhihua, Wu Xinxin, Li Wenming, Cao Huawei, An Xuejun, Ye Xiaochun, Fan Dongrui. Dataflow Architecture Optimization for Low-Precision Neural Networks[J]. Journal of Computer Research and Development, 2023, 60(1): 43-58. DOI: 10.7544/issn1000-1239.202111275
Citation: Fan Zhihua, Wu Xinxin, Li Wenming, Cao Huawei, An Xuejun, Ye Xiaochun, Fan Dongrui. Dataflow Architecture Optimization for Low-Precision Neural Networks[J]. Journal of Computer Research and Development, 2023, 60(1): 43-58. DOI: 10.7544/issn1000-1239.202111275
范志华, 吴欣欣, 李文明, 曹华伟, 安学军, 叶笑春, 范东睿. 面向低精度神经网络的数据流体系结构优化[J]. 计算机研究与发展, 2023, 60(1): 43-58. CSTR: 32373.14.issn1000-1239.202111275
引用本文: 范志华, 吴欣欣, 李文明, 曹华伟, 安学军, 叶笑春, 范东睿. 面向低精度神经网络的数据流体系结构优化[J]. 计算机研究与发展, 2023, 60(1): 43-58. CSTR: 32373.14.issn1000-1239.202111275
Fan Zhihua, Wu Xinxin, Li Wenming, Cao Huawei, An Xuejun, Ye Xiaochun, Fan Dongrui. Dataflow Architecture Optimization for Low-Precision Neural Networks[J]. Journal of Computer Research and Development, 2023, 60(1): 43-58. CSTR: 32373.14.issn1000-1239.202111275
Citation: Fan Zhihua, Wu Xinxin, Li Wenming, Cao Huawei, An Xuejun, Ye Xiaochun, Fan Dongrui. Dataflow Architecture Optimization for Low-Precision Neural Networks[J]. Journal of Computer Research and Development, 2023, 60(1): 43-58. CSTR: 32373.14.issn1000-1239.202111275

面向低精度神经网络的数据流体系结构优化

基金项目: 中国科学院战略性先导科技专项(C类)(XDC05000000);国家自然科学基金项目(61732018,61872335);中国科学院国际伙伴计划项目(171111KYSB20200002);之江实验室开放项目(2022PB0AB01);中国科学院青年创新促进会
详细信息
    作者简介:

    范志华: 1996年生.博士研究生.CCF学生会员.主要研究方向为数据流体系结构及高通量计算架构

    吴欣欣: 1992年生.博士.CCF学生会员.主要研究方向为神经网络架构、数据流架构

    李文明: 1988年生.博士,副研究员,硕士生导师.CCF高级会员.主要研究方向为高通量计算架构及软件模拟技术

    曹华伟: 1989年生.博士,副研究员.CCF会员.主要研究方向为并行计算及高通量计算架构

    安学军: 1966年生.博士,正高级工程师,博士生导师.CCF会员.主要研究方向为计算机系统结构、高性能互联网络

    叶笑春: 1981年生.博士,研究员.CCF高级会员.主要研究方向为高通量计算架构及软件模拟技术

    范东睿: 1979年生.博士,研究员,博士生导师.CCF杰出会员.主要研究方向为高通量、高性能众核处理器微结构

    通讯作者:

    李文明(liwenming@ict.ac.cn

  • 中图分类号: TP183

Dataflow Architecture Optimization for Low-Precision Neural Networks

Funds: This work was supported by the Strategic Priority Research Program of Chinese Academy of Sciences (XDC05000000), the National Natural Science Foundation of China (61732018, 61872335), the International Partnership Program of Chinese Academy of Sciences (171111KYSB20200002), the Open Project of Zhejiang Lab (2022PB0AB01), and the Youth Innovation Promotion Association of Chinese Academy of Sciences.
  • 摘要:

    数据流架构的执行方式与神经网络算法具有高度匹配性,能充分挖掘数据的并行性. 然而,随着神经网络向更低精度的发展,数据流架构的研究并未面向低精度神经网络展开,在传统数据流架构部署低精度(INT8,INT4或者更低)神经网络时,会面临3个问题:1)传统数据流架构的计算部件数据通路与低精度数据不匹配,无法体现低精度神经网络的性能和能效优势;2)向量化并行计算的低精度数据在片上存储中要求顺序排列,然而它在片外存储层次中是分散排列的,使得数据的加载和写回操作变得复杂,传统数据流架构的访存部件无法高效支持这种复杂的访存模式;3)传统数据流架构中使用双缓冲机制掩盖数据的传输延迟,但是,当传输低精度数据时,传输带宽的利用率显著降低,导致计算延迟无法掩盖数据传输延迟,双缓冲机制面临失效风险,进而影响数据流架构的性能和能效.为解决这3个问题,设计了面向低精度神经网络的数据流加速器DPU_Q.首先,设计了灵活可重构的计算单元,根据指令的精度标志位动态重构数据通路,一方面能高效灵活地支持多种低精度数据运算,另一方面能进一步提高计算并行性和吞吐量. 另外,为解决低精度神经网络复杂的访存模式,设计了Scatter引擎,该引擎将在低层次或者片外存储中地址空间离散分布的低精度数据进行拼接、预处理,以满足高层次或者片上存储对数据排列的格式要求.同时,Scatter引擎能有效解决传输低精度数据时带宽利用率低的问题,解决了双缓冲机制失效的问题.最后,从软件方面提出了基于数据流执行模式的低精度神经网络映射算法,兼顾负载均衡的同时能对权重、激活值数据进行充分复用,减少了访存和数据流图节点间的数据传输开销.实验表明,相比于同精度的GPU(Titan Xp)、数据流架构(Eyeriss)和低精度神经网络加速器(BitFusion),DPU_Q分别获得3. 18倍、6.05倍、1.52倍的性能提升和4.49倍、1.6倍、1.13倍的能效提升.

    Abstract:

    The execution model of the dataflow architecture is similar to the execution of neural network algorithm, which can exploit more parallelism. However, with the development of low-precision neural networks, the research on dataflow architecture has not been developed for low-precision neural networks. When low-precision (INT8, INT4 or lower) neural networks are deployed in traditional dataflow architectures, they will face the following three challenges: 1) The data path of the traditional dataflow architecture does not match the low-precision data, which cannot reflect the performance and energy efficiency advantages of the low-precision neural networks. 2) Vectorized low-precision data are required to be arranged in order in the on-chip memory, but these data are arranged in a scattered manner in the off-chip memory hierarchy, which makes data loading and writing back operations more complicated. The memory access components of the traditional dataflow architecture cannot support this complex memory access mode efficiently. 3) In traditional dataflow architecture, the double buffering mechanism is used to conceal the transmission delay. However, when low-precision data are transmitted, the utilization of the transmission bandwidth is significantly reduced, resulting in calculation delays that cannot cover the data transmission delay, and the double buffering mechanism faces the risk of failure, thereby affecting the performance and energy efficiency of the dataflow architecture. In order to solve the above problems, we optimize the dataflow architecture and design a low-precision neural networks accelerator named DPU_Q. First of all, a flexible and reconfigurable computing unit is designed, which dynamically reconstructs the data path according to the precision flag of the instruction. On the one hand, it can efficiently and flexibly support a variety of low-precision operations. On the other hand, the performance and throughput of the architecture can be further improved in this way. In addition, in order to solve the complex memory access mode of low-precision data, we design Scatter engine, which can splice and preprocess the low-precision data discretely distributed in the off-chip/low-level memory hierarchy to meet the format requirements of the on-chip/high-level memory hierarchy for data arrangement. At the same time, Scatter engine can effectively solve the problem of reduced bandwidth utilization when transmitting low-precision data. The transmission delay will not increase significantly, so it can be completely covered by the double buffer mechanism. Finally, a low-precision neural network scheduling method is proposed, which can fully reuse weights, activation values, reducing memory access overhead. Experiments show that compared with the same precision GPU (Titan Xp), state-of-the-art dataflow architecture (Eyeriss) and state-of-the-art low-precision neural network accelerator (BitFusion), DPU_Q achieves 3.18×, 6.05×, and 1.52× of performance improvement and 4.49×, 1.6×, and 1.13× of energy efficiency improvement, respectively.

  • 图  1   神经网络模型准确率损失与数据位宽的关系

    Figure  1.   Relationship between the accuracy loss of the neural network model and the quantization of bit-width

    图  2   DPU架构

    Figure  2.   Overall architecture of DPU

    图  3   卷积运算的数据通路

    Figure  3.   The data path of convolution

    图  4   低精度数据的存储访问模式

    Figure  4.   Memory access patterns in low precision data

    图  5   AlexNet网络不同精度的计算和传输开销

    Figure  5.   Calculation and transmission overhead of different precisions in AlexNet

    图  6   DPU_Q总体架构

    Figure  6.   The overall architecture of DPU_Q

    图  7   DPU_Q中低精度卷积的数据通路

    Figure  7.   The data path of low-precision convolution in DPU_Q

    图  8   Scatter引擎结构图

    Figure  8.   The structure of the Scatter engine

    图  9   AlexNet和VGG16各层的数据传输和执行时间占比

    Figure  9.   Data transmission and execution time proportion of each layer of AlexNet and VGG16

    图  10   DPU_Q相对于DPU的加速比提升

    Figure  10.   Speedup promotion of DPU_Q over DPU

    图  11   DPU_Q相对于Eyeriss的加速比

    Figure  11.   Speedup of DPU_Q over Eyeriss

    图  12   DPU_Q,BitFusion,Eyeriss相对于GPU的性能对比

    Figure  12.   Performance comparison of DPU_Q,BitFusion,Eyeriss over GPU

    图  13   能效对比

    Figure  13.   Energy efficiency comparison

    图  14   DPU_Q的面积和功耗分布

    Figure  14.   Distribution of area and power consumption of DPU_Q

    表  1   代表性低精度神经网络加速器

    Table  1   Representative Low-Precision Neural Network Accelerators

    加速器支持多精度灵活性设计重点
    Eyeriss[7]较好数据流
    AQSS[4]较差计算
    OLAccel[5]较差计算
    DRQ[3]较差计算
    BitFusion[6]较差计算
    DPU_Q较好计算、访存
    下载: 导出CSV

    表  2   代表性低精度神经网络

    Table  2   Representative Low-Precision Neural Networks

    模型量化对象精度/b
    Q_CNN[13]权值4
    EIA[14]权值/激活值8
    DFP[15]权值/激活值4,8
    LSQ[16]权值/激活值4,16
    QIL[17]权值/激活值4
    下载: 导出CSV

    表  3   DPU_Q的配置信息

    Table  3   Configuration Information of DPU_Q

    模块配置信息
    微控制器ARM 核
    PE8×8, SIMD8, 1 GHz, 8 KB 指令缓存, 32KB数据缓存
    片上网络2维 mesh,1套访存网络,
    1套控制网络,1套PE间通信网络
    片外存储DDR3,1333MHz
    SPM/MB6
    下载: 导出CSV

    表  4   测试程序信息

    Table  4   Benchmark Information

    CNN 模型卷积层特征矩阵规模
    ( H × W × C)
    卷积核规模
    (R × S ×M)
    AlexNetconv1227 × 227 × 311 × 11 × 96
    conv231 × 31 × 965 × 5 × 256
    conv315 × 15 × 2563 × 3 × 384
    conv415 × 15 × 3843 × 3 × 384
    conv515 × 15 × 3843 × 3 × 256
    VGG16conv1_1224 × 224 × 33 × 3 × 64
    conv1_2224 × 224 × 643 × 3 × 64
    conv2_1112 × 112 × 643 × 3 × 128
    conv2_2112 × 112 × 1283 × 3 × 128
    conv3_156 × 56 × 1283 × 3 × 256
    conv3_256 × 56 × 2563 × 3 × 256
    conv3_356 × 56 × 2563 × 3 × 256
    conv4_128 × 28 × 2563 × 3 × 512
    conv4_228 × 28 × 5123 × 3 × 512
    conv4_328 × 28 × 5123 × 3 × 512
    conv5_114 × 14 × 5123 × 3 × 512
    conv5_214 × 14 × 5123 × 3 × 512
    conv5_314 × 14 × 5123 × 3 × 512
    注:HWC分别表示特征矩阵的高、宽、通道;RSM分别表示卷积核的高、宽、通道.
    下载: 导出CSV
  • [1]

    Krizhevsky A, Sutskever I, Hinton G. Imagenet classification with deep convolutional neural networks[C] //Proc of the 25th Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2012: 1097−1105

    [2]

    Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint, arXiv: 1409.1556, 2014

    [3]

    Song Zhuoran, Fu Bangqi, Wu Feiyang, et al. DRQ: Dynamic region-based quantization for deep neural network acceleration[C] //Proc of the 47th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2020: 1010−1021

    [4]

    Ueki T, Keisuke I, Matsubara T, et al. AQSS: Accelerator of quantization neural networks with stochastic approach[C] //Proc of the 6th Int Symp on Computing and Networking Workshops. Piscataway, NJ: IEEE, 2018: 138−144

    [5]

    Park E, Kim D, Yoo S. Energy-efficient neural network accelerator based on outlier-aware low-precision computation[C] //Proc of the 45th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2018: 688−698

    [6]

    Sharma H, Park J, Suda N, et al. BitFusion: Bit-level dynamically composable architecture for accelerating deep neural network[C] //Proc of the 45th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2018: 764−775

    [7]

    Chen Yu-Hsin, Krishna T, Emer J, et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks[J]. IEEE Journal of Solid-State Circuits, 2017, 52(1): 127−138 doi: 10.1109/JSSC.2016.2616357

    [8]

    Wu Xinxin, Fan Zhihua, Liu Tianyu, et al. LRP: Predictive output activation based on SVD approach for CNNs acceleration[C] //Proc of the 25th Design, Automation & Test in Europe. Piscataway, NJ: IEEE, 2022: 837−842

    [9]

    Courbariaux M, Bengio Y, David J. BinaryConnect: Training deep neural networks with binary weights during propagations[C] //Proc of the 28th Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2015: 3123−3131

    [10]

    Zhang Donqing, Yang Jiaolong, Ye Dongqiangzi, et al. LQ-Nets: Learned quantization for highly accurate and compact deep neural networks[C] //Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 365−382

    [11]

    Wang Naigang, Choi J, Brand D, et al. Training deep neural networks with 8-bit floating point numbers[C] //Proc of the 31st Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2018: 7685−7694

    [12]

    Chen Tianshi, Du Zidong, Sun Ninghui, et al. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning[C] //Proc of the 19th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2014: 269−284

    [13]

    Wu Jiaxiang, Leng Cong, Wang Yonghang, et al. Quantized convolutional neural networks for mobile devices[C] //Proc of the 29th Int Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 4820−4828

    [14]

    Park E, Yoo S, Vajda P. Value-aware quantization for training and inference of neural networks[C] //Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 608−624

    [15]

    Deng Lei, Li Guoqi, Han Song, et al. Model compression and hardware acceleration for neural networks: A comprehensive survey[J]. Proceedings of IEEE, 2020, 108(4): 485−532 doi: 10.1109/JPROC.2020.2976475

    [16]

    Jung S, Son C, Lee S, et al. Learning to quantize deep networks by optimizing quantization intervals with task loss[C] //Proc of the 31st Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 4350−4359

    [17]

    Szegedy C, Liu Wei, Jia Yangqing, et al. Going deeper with convolutions[C] //Proc of the 27th Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 1−9

    [18]

    Han Song, Mao Huizi, Dally J. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding[J]. arXiv preprint, arXiv: 1510.00149v2, 2015

    [19]

    Ye Xiaochun, Fan Dongrui, Sun Ninghui, et al. SimICT: A fast and flexible framework for performance and power evaluation of large-scale architecture[C] //Proc of the 18th Int Symp on Low Power Electronics and Design (ISLPED). Piscataway, NJ: IEEE, 2013: 273−278

  • 期刊类型引用(2)

    1. 刘阳,鲁圆圆,郭成城. 基于优先级的数据中心任务优化调度算法设计. 计算机仿真. 2025(01): 497-500+507 . 百度学术
    2. 骆海霞. 基于递推估计的Web前端偶发任务能耗感知方法. 黑龙江工业学院学报(综合版). 2023(10): 115-120 . 百度学术

    其他类型引用(1)

图(14)  /  表(4)
计量
  • 文章访问数:  311
  • HTML全文浏览量:  27
  • PDF下载量:  225
  • 被引次数: 3
出版历程
  • 收稿日期:  2021-12-23
  • 修回日期:  2022-06-06
  • 网络出版日期:  2023-02-10
  • 刊出日期:  2022-12-31

目录

    /

    返回文章
    返回