Processing math: 100%
  • 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

面向低精度神经网络的数据流体系结构优化

范志华, 吴欣欣, 李文明, 曹华伟, 安学军, 叶笑春, 范东睿

范志华, 吴欣欣, 李文明, 曹华伟, 安学军, 叶笑春, 范东睿. 面向低精度神经网络的数据流体系结构优化[J]. 计算机研究与发展, 2023, 60(1): 43-58. DOI: 10.7544/issn1000-1239.202111275
引用本文: 范志华, 吴欣欣, 李文明, 曹华伟, 安学军, 叶笑春, 范东睿. 面向低精度神经网络的数据流体系结构优化[J]. 计算机研究与发展, 2023, 60(1): 43-58. DOI: 10.7544/issn1000-1239.202111275
Fan Zhihua, Wu Xinxin, Li Wenming, Cao Huawei, An Xuejun, Ye Xiaochun, Fan Dongrui. Dataflow Architecture Optimization for Low-Precision Neural Networks[J]. Journal of Computer Research and Development, 2023, 60(1): 43-58. DOI: 10.7544/issn1000-1239.202111275
Citation: Fan Zhihua, Wu Xinxin, Li Wenming, Cao Huawei, An Xuejun, Ye Xiaochun, Fan Dongrui. Dataflow Architecture Optimization for Low-Precision Neural Networks[J]. Journal of Computer Research and Development, 2023, 60(1): 43-58. DOI: 10.7544/issn1000-1239.202111275
范志华, 吴欣欣, 李文明, 曹华伟, 安学军, 叶笑春, 范东睿. 面向低精度神经网络的数据流体系结构优化[J]. 计算机研究与发展, 2023, 60(1): 43-58. CSTR: 32373.14.issn1000-1239.202111275
引用本文: 范志华, 吴欣欣, 李文明, 曹华伟, 安学军, 叶笑春, 范东睿. 面向低精度神经网络的数据流体系结构优化[J]. 计算机研究与发展, 2023, 60(1): 43-58. CSTR: 32373.14.issn1000-1239.202111275
Fan Zhihua, Wu Xinxin, Li Wenming, Cao Huawei, An Xuejun, Ye Xiaochun, Fan Dongrui. Dataflow Architecture Optimization for Low-Precision Neural Networks[J]. Journal of Computer Research and Development, 2023, 60(1): 43-58. CSTR: 32373.14.issn1000-1239.202111275
Citation: Fan Zhihua, Wu Xinxin, Li Wenming, Cao Huawei, An Xuejun, Ye Xiaochun, Fan Dongrui. Dataflow Architecture Optimization for Low-Precision Neural Networks[J]. Journal of Computer Research and Development, 2023, 60(1): 43-58. CSTR: 32373.14.issn1000-1239.202111275

面向低精度神经网络的数据流体系结构优化

基金项目: 中国科学院战略性先导科技专项(C类)(XDC05000000);国家自然科学基金项目(61732018,61872335);中国科学院国际伙伴计划项目(171111KYSB20200002);之江实验室开放项目(2022PB0AB01);中国科学院青年创新促进会
详细信息
    作者简介:

    范志华: 1996年生.博士研究生.CCF学生会员.主要研究方向为数据流体系结构及高通量计算架构

    吴欣欣: 1992年生.博士.CCF学生会员.主要研究方向为神经网络架构、数据流架构

    李文明: 1988年生.博士,副研究员,硕士生导师.CCF高级会员.主要研究方向为高通量计算架构及软件模拟技术

    曹华伟: 1989年生.博士,副研究员.CCF会员.主要研究方向为并行计算及高通量计算架构

    安学军: 1966年生.博士,正高级工程师,博士生导师.CCF会员.主要研究方向为计算机系统结构、高性能互联网络

    叶笑春: 1981年生.博士,研究员.CCF高级会员.主要研究方向为高通量计算架构及软件模拟技术

    范东睿: 1979年生.博士,研究员,博士生导师.CCF杰出会员.主要研究方向为高通量、高性能众核处理器微结构

    通讯作者:

    李文明(liwenming@ict.ac.cn

  • 中图分类号: TP183

Dataflow Architecture Optimization for Low-Precision Neural Networks

Funds: This work was supported by the Strategic Priority Research Program of Chinese Academy of Sciences (XDC05000000), the National Natural Science Foundation of China (61732018, 61872335), the International Partnership Program of Chinese Academy of Sciences (171111KYSB20200002), the Open Project of Zhejiang Lab (2022PB0AB01), and the Youth Innovation Promotion Association of Chinese Academy of Sciences.
  • 摘要:

    数据流架构的执行方式与神经网络算法具有高度匹配性,能充分挖掘数据的并行性. 然而,随着神经网络向更低精度的发展,数据流架构的研究并未面向低精度神经网络展开,在传统数据流架构部署低精度(INT8,INT4或者更低)神经网络时,会面临3个问题:1)传统数据流架构的计算部件数据通路与低精度数据不匹配,无法体现低精度神经网络的性能和能效优势;2)向量化并行计算的低精度数据在片上存储中要求顺序排列,然而它在片外存储层次中是分散排列的,使得数据的加载和写回操作变得复杂,传统数据流架构的访存部件无法高效支持这种复杂的访存模式;3)传统数据流架构中使用双缓冲机制掩盖数据的传输延迟,但是,当传输低精度数据时,传输带宽的利用率显著降低,导致计算延迟无法掩盖数据传输延迟,双缓冲机制面临失效风险,进而影响数据流架构的性能和能效.为解决这3个问题,设计了面向低精度神经网络的数据流加速器DPU_Q.首先,设计了灵活可重构的计算单元,根据指令的精度标志位动态重构数据通路,一方面能高效灵活地支持多种低精度数据运算,另一方面能进一步提高计算并行性和吞吐量. 另外,为解决低精度神经网络复杂的访存模式,设计了Scatter引擎,该引擎将在低层次或者片外存储中地址空间离散分布的低精度数据进行拼接、预处理,以满足高层次或者片上存储对数据排列的格式要求.同时,Scatter引擎能有效解决传输低精度数据时带宽利用率低的问题,解决了双缓冲机制失效的问题.最后,从软件方面提出了基于数据流执行模式的低精度神经网络映射算法,兼顾负载均衡的同时能对权重、激活值数据进行充分复用,减少了访存和数据流图节点间的数据传输开销.实验表明,相比于同精度的GPU(Titan Xp)、数据流架构(Eyeriss)和低精度神经网络加速器(BitFusion),DPU_Q分别获得3. 18倍、6.05倍、1.52倍的性能提升和4.49倍、1.6倍、1.13倍的能效提升.

    Abstract:

    The execution model of the dataflow architecture is similar to the execution of neural network algorithm, which can exploit more parallelism. However, with the development of low-precision neural networks, the research on dataflow architecture has not been developed for low-precision neural networks. When low-precision (INT8, INT4 or lower) neural networks are deployed in traditional dataflow architectures, they will face the following three challenges: 1) The data path of the traditional dataflow architecture does not match the low-precision data, which cannot reflect the performance and energy efficiency advantages of the low-precision neural networks. 2) Vectorized low-precision data are required to be arranged in order in the on-chip memory, but these data are arranged in a scattered manner in the off-chip memory hierarchy, which makes data loading and writing back operations more complicated. The memory access components of the traditional dataflow architecture cannot support this complex memory access mode efficiently. 3) In traditional dataflow architecture, the double buffering mechanism is used to conceal the transmission delay. However, when low-precision data are transmitted, the utilization of the transmission bandwidth is significantly reduced, resulting in calculation delays that cannot cover the data transmission delay, and the double buffering mechanism faces the risk of failure, thereby affecting the performance and energy efficiency of the dataflow architecture. In order to solve the above problems, we optimize the dataflow architecture and design a low-precision neural networks accelerator named DPU_Q. First of all, a flexible and reconfigurable computing unit is designed, which dynamically reconstructs the data path according to the precision flag of the instruction. On the one hand, it can efficiently and flexibly support a variety of low-precision operations. On the other hand, the performance and throughput of the architecture can be further improved in this way. In addition, in order to solve the complex memory access mode of low-precision data, we design Scatter engine, which can splice and preprocess the low-precision data discretely distributed in the off-chip/low-level memory hierarchy to meet the format requirements of the on-chip/high-level memory hierarchy for data arrangement. At the same time, Scatter engine can effectively solve the problem of reduced bandwidth utilization when transmitting low-precision data. The transmission delay will not increase significantly, so it can be completely covered by the double buffer mechanism. Finally, a low-precision neural network scheduling method is proposed, which can fully reuse weights, activation values, reducing memory access overhead. Experiments show that compared with the same precision GPU (Titan Xp), state-of-the-art dataflow architecture (Eyeriss) and state-of-the-art low-precision neural network accelerator (BitFusion), DPU_Q achieves 3.18×, 6.05×, and 1.52× of performance improvement and 4.49×, 1.6×, and 1.13× of energy efficiency improvement, respectively.

  • 计算机存储系统承载数据,是信息平台的核心基础设施. 近年来,全球数据规模爆发式增长,计算机存储系统面临着高速数据访问、海量数据存储以及存储服务质量保障的挑战. 同时,由于新型硬件(如NVMe SSD、持久内存、异构加速设备等)的发展与成熟,存储系统技术研究面临着诸多新的机遇.

    基于上述背景,为促进存储领域的技术交流,《计算机研究与发展》推出了本期存储专题. 本期专题收录了6篇论文,分别展示了新硬件环境下存储系统设计和大规模数据存储服务质量保障等存储领域关注热点的研究现状和最新研究成果,希望能为从事相关工作的读者提供借鉴和帮助.

    周小晖等作者的论文“基于融合学习的无监督多维时间序列异常检测”针对多维时间序列异常检测效果差的问题,提出了一种基于融合学习的无监督多维时间序列异常检测方法. 该方法同时对多维时间序列的数据局部特征和数据全局特征进行建模,并基于重构误差检测异常,提升了异常检测效果.

    刘扬等作者的论文“ZB+ -tree:一种 ZNS SSD 感知的新型索引结构”针对传统的 B+ -tree 索引结构不适配 ZNS SSD 的问题,提出了ZNS SSD感知的ZB+ -tree索引结构. 该索引结构通过将索引节点在常规Zone和顺序Zone分散存储,实现了运行时间和空间利用率指标的提升.

    屠要峰等作者的论文“UStore:面向新型硬件的统一存储系统”为适配 NVMe SSD、持久内存、异构加速设备等新型硬件的特性,提出了一种兼容多种存储介质的统一存储系统 UStore. 该存储系统包括与物理存储介质形态解耦的元数据设计、高效的数据管理机制和更新策略,充分发挥了存储硬件的特性和性能.

    杨勇鹏等作者的论文“一种 wandering B+ tree 问题解决方法”针对日志结构存储系统中B+ tree树结点异地更新会导致树结构递归更新的问题,提出 IBT B+ tree 的解决方法. 该方法将树结点逻辑索引和物理地址均存放在树结构中,同时引入 dirty 链表设计和非递归更新的 IBT B+ tree 下刷算法,实现在不引入额外开销的条件下解决wandering B+ tree的问题.

    文宇鸿等作者的论文“多租户固态盘服务质量保障技术综述”深入分析了多租户固态盘服务质量保障面临的性能干扰、性能不公平及总体性能损失问题,分类介绍了以保障性能隔离、性能公平、优化总体性能为目标的研究工作及技术演进方向,总结了多租户固态盘服务质量保障技术的研究现状并对未来研究方向进行了展望.

    胡浩等作者的论文“新型内存硬件环境中的事务管理系统综述”全面总结了新型硬件环境下的事务管理系统,阐述了当前基于新型硬件事务管理系统的技术路线,重点剖析了硬件事务内存和非易失性存储硬件下的事务管理系统的优势和不足,指明了新型硬件环境中事务管理系统潜在的发展方向以及面临的挑战.

    本专题所录用的6篇论文中,1篇论文重点关注云系统中多维时间序列的故障检测,3篇论文重点关注新硬件环境下的存储系统设计及索引结构设计,2篇论文对基于新型硬件的事务管理系统和多租户固态盘服务质量保障技术进行了综述. 由于专题篇幅有限等原因,本专题无法全面覆盖存储领域各方面的最新研究进展,不当之处请同行学者批评指正! 感谢各位作者、审稿专家和编辑部的全力支持和辛勤付出!

    舒继武 (清华大学)

    王意洁 (国防科技大学)

    2023年2月

  • 图  1   神经网络模型准确率损失与数据位宽的关系

    Figure  1.   Relationship between the accuracy loss of the neural network model and the quantization of bit-width

    图  2   DPU架构

    Figure  2.   Overall architecture of DPU

    图  3   卷积运算的数据通路

    Figure  3.   The data path of convolution

    图  4   低精度数据的存储访问模式

    Figure  4.   Memory access patterns in low precision data

    图  5   AlexNet网络不同精度的计算和传输开销

    Figure  5.   Calculation and transmission overhead of different precisions in AlexNet

    图  6   DPU_Q总体架构

    Figure  6.   The overall architecture of DPU_Q

    图  7   DPU_Q中低精度卷积的数据通路

    Figure  7.   The data path of low-precision convolution in DPU_Q

    图  8   Scatter引擎结构图

    Figure  8.   The structure of the Scatter engine

    图  9   AlexNet和VGG16各层的数据传输和执行时间占比

    Figure  9.   Data transmission and execution time proportion of each layer of AlexNet and VGG16

    图  10   DPU_Q相对于DPU的加速比提升

    Figure  10.   Speedup promotion of DPU_Q over DPU

    图  11   DPU_Q相对于Eyeriss的加速比

    Figure  11.   Speedup of DPU_Q over Eyeriss

    图  12   DPU_Q,BitFusion,Eyeriss相对于GPU的性能对比

    Figure  12.   Performance comparison of DPU_Q,BitFusion,Eyeriss over GPU

    图  13   能效对比

    Figure  13.   Energy efficiency comparison

    图  14   DPU_Q的面积和功耗分布

    Figure  14.   Distribution of area and power consumption of DPU_Q

    表  1   代表性低精度神经网络加速器

    Table  1   Representative Low-Precision Neural Network Accelerators

    加速器支持多精度灵活性设计重点
    Eyeriss[7]较好数据流
    AQSS[4]较差计算
    OLAccel[5]较差计算
    DRQ[3]较差计算
    BitFusion[6]较差计算
    DPU_Q较好计算、访存
    下载: 导出CSV

    表  2   代表性低精度神经网络

    Table  2   Representative Low-Precision Neural Networks

    模型量化对象精度/b
    Q_CNN[13]权值4
    EIA[14]权值/激活值8
    DFP[15]权值/激活值4,8
    LSQ[16]权值/激活值4,16
    QIL[17]权值/激活值4
    下载: 导出CSV

    表  3   DPU_Q的配置信息

    Table  3   Configuration Information of DPU_Q

    模块配置信息
    微控制器ARM 核
    PE8×8, SIMD8, 1 GHz, 8 KB 指令缓存, 32KB数据缓存
    片上网络2维 mesh,1套访存网络,
    1套控制网络,1套PE间通信网络
    片外存储DDR3,1333MHz
    SPM/MB6
    下载: 导出CSV

    表  4   测试程序信息

    Table  4   Benchmark Information

    CNN 模型卷积层特征矩阵规模
    ( H × W × C)
    卷积核规模
    (R × S ×M)
    AlexNetconv1227 × 227 × 311 × 11 × 96
    conv231 × 31 × 965 × 5 × 256
    conv315 × 15 × 2563 × 3 × 384
    conv415 × 15 × 3843 × 3 × 384
    conv515 × 15 × 3843 × 3 × 256
    VGG16conv1_1224 × 224 × 33 × 3 × 64
    conv1_2224 × 224 × 643 × 3 × 64
    conv2_1112 × 112 × 643 × 3 × 128
    conv2_2112 × 112 × 1283 × 3 × 128
    conv3_156 × 56 × 1283 × 3 × 256
    conv3_256 × 56 × 2563 × 3 × 256
    conv3_356 × 56 × 2563 × 3 × 256
    conv4_128 × 28 × 2563 × 3 × 512
    conv4_228 × 28 × 5123 × 3 × 512
    conv4_328 × 28 × 5123 × 3 × 512
    conv5_114 × 14 × 5123 × 3 × 512
    conv5_214 × 14 × 5123 × 3 × 512
    conv5_314 × 14 × 5123 × 3 × 512
    注:HWC分别表示特征矩阵的高、宽、通道;RSM分别表示卷积核的高、宽、通道.
    下载: 导出CSV
  • [1]

    Krizhevsky A, Sutskever I, Hinton G. Imagenet classification with deep convolutional neural networks[C] //Proc of the 25th Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2012: 1097−1105

    [2]

    Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint, arXiv: 1409.1556, 2014

    [3]

    Song Zhuoran, Fu Bangqi, Wu Feiyang, et al. DRQ: Dynamic region-based quantization for deep neural network acceleration[C] //Proc of the 47th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2020: 1010−1021

    [4]

    Ueki T, Keisuke I, Matsubara T, et al. AQSS: Accelerator of quantization neural networks with stochastic approach[C] //Proc of the 6th Int Symp on Computing and Networking Workshops. Piscataway, NJ: IEEE, 2018: 138−144

    [5]

    Park E, Kim D, Yoo S. Energy-efficient neural network accelerator based on outlier-aware low-precision computation[C] //Proc of the 45th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2018: 688−698

    [6]

    Sharma H, Park J, Suda N, et al. BitFusion: Bit-level dynamically composable architecture for accelerating deep neural network[C] //Proc of the 45th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2018: 764−775

    [7]

    Chen Yu-Hsin, Krishna T, Emer J, et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks[J]. IEEE Journal of Solid-State Circuits, 2017, 52(1): 127−138 doi: 10.1109/JSSC.2016.2616357

    [8]

    Wu Xinxin, Fan Zhihua, Liu Tianyu, et al. LRP: Predictive output activation based on SVD approach for CNNs acceleration[C] //Proc of the 25th Design, Automation & Test in Europe. Piscataway, NJ: IEEE, 2022: 837−842

    [9]

    Courbariaux M, Bengio Y, David J. BinaryConnect: Training deep neural networks with binary weights during propagations[C] //Proc of the 28th Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2015: 3123−3131

    [10]

    Zhang Donqing, Yang Jiaolong, Ye Dongqiangzi, et al. LQ-Nets: Learned quantization for highly accurate and compact deep neural networks[C] //Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 365−382

    [11]

    Wang Naigang, Choi J, Brand D, et al. Training deep neural networks with 8-bit floating point numbers[C] //Proc of the 31st Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2018: 7685−7694

    [12]

    Chen Tianshi, Du Zidong, Sun Ninghui, et al. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning[C] //Proc of the 19th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2014: 269−284

    [13]

    Wu Jiaxiang, Leng Cong, Wang Yonghang, et al. Quantized convolutional neural networks for mobile devices[C] //Proc of the 29th Int Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 4820−4828

    [14]

    Park E, Yoo S, Vajda P. Value-aware quantization for training and inference of neural networks[C] //Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 608−624

    [15]

    Deng Lei, Li Guoqi, Han Song, et al. Model compression and hardware acceleration for neural networks: A comprehensive survey[J]. Proceedings of IEEE, 2020, 108(4): 485−532 doi: 10.1109/JPROC.2020.2976475

    [16]

    Jung S, Son C, Lee S, et al. Learning to quantize deep networks by optimizing quantization intervals with task loss[C] //Proc of the 31st Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 4350−4359

    [17]

    Szegedy C, Liu Wei, Jia Yangqing, et al. Going deeper with convolutions[C] //Proc of the 27th Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 1−9

    [18]

    Han Song, Mao Huizi, Dally J. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding[J]. arXiv preprint, arXiv: 1510.00149v2, 2015

    [19]

    Ye Xiaochun, Fan Dongrui, Sun Ninghui, et al. SimICT: A fast and flexible framework for performance and power evaluation of large-scale architecture[C] //Proc of the 18th Int Symp on Low Power Electronics and Design (ISLPED). Piscataway, NJ: IEEE, 2013: 273−278

图(14)  /  表(4)
计量
  • 文章访问数:  329
  • HTML全文浏览量:  33
  • PDF下载量:  230
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-12-23
  • 修回日期:  2022-06-06
  • 网络出版日期:  2023-02-10
  • 刊出日期:  2022-12-31

目录

    /

    返回文章
    返回