面向低精度神经网络的数据流体系结构优化

范志华; 吴欣欣; 李文明; 曹华伟; 安学军; 叶笑春; 范东睿

doi:10.7544/issn1000-1239.202111275

面向低精度神经网络的数据流体系结构优化

Dataflow Architecture Optimization for Low-Precision Neural Networks

摘要

摘要: 数据流架构的执行方式与神经网络算法具有高度匹配性，能充分挖掘数据的并行性. 然而，随着神经网络向更低精度的发展，数据流架构的研究并未面向低精度神经网络展开，在传统数据流架构部署低精度（INT8，INT4或者更低）神经网络时,会面临3个问题：1）传统数据流架构的计算部件数据通路与低精度数据不匹配，无法体现低精度神经网络的性能和能效优势；2）向量化并行计算的低精度数据在片上存储中要求顺序排列，然而它在片外存储层次中是分散排列的，使得数据的加载和写回操作变得复杂，传统数据流架构的访存部件无法高效支持这种复杂的访存模式；3）传统数据流架构中使用双缓冲机制掩盖数据的传输延迟，但是，当传输低精度数据时，传输带宽的利用率显著降低，导致计算延迟无法掩盖数据传输延迟，双缓冲机制面临失效风险，进而影响数据流架构的性能和能效.为解决这3个问题，设计了面向低精度神经网络的数据流加速器DPU_Q.首先，设计了灵活可重构的计算单元，根据指令的精度标志位动态重构数据通路，一方面能高效灵活地支持多种低精度数据运算，另一方面能进一步提高计算并行性和吞吐量. 另外，为解决低精度神经网络复杂的访存模式，设计了Scatter引擎，该引擎将在低层次或者片外存储中地址空间离散分布的低精度数据进行拼接、预处理，以满足高层次或者片上存储对数据排列的格式要求.同时，Scatter引擎能有效解决传输低精度数据时带宽利用率低的问题，解决了双缓冲机制失效的问题.最后，从软件方面提出了基于数据流执行模式的低精度神经网络映射算法，兼顾负载均衡的同时能对权重、激活值数据进行充分复用，减少了访存和数据流图节点间的数据传输开销.实验表明，相比于同精度的GPU（Titan Xp）、数据流架构（Eyeriss）和低精度神经网络加速器（BitFusion），DPU_Q分别获得3. 18倍、6.05倍、1.52倍的性能提升和4.49倍、1.6倍、1.13倍的能效提升.

Abstract: The execution model of the dataflow architecture is similar to the execution of neural network algorithm, which can exploit more parallelism. However, with the development of low-precision neural networks, the research on dataflow architecture has not been developed for low-precision neural networks. When low-precision (INT8, INT4 or lower) neural networks are deployed in traditional dataflow architectures, they will face the following three challenges: 1) The data path of the traditional dataflow architecture does not match the low-precision data, which cannot reflect the performance and energy efficiency advantages of the low-precision neural networks. 2) Vectorized low-precision data are required to be arranged in order in the on-chip memory, but these data are arranged in a scattered manner in the off-chip memory hierarchy, which makes data loading and writing back operations more complicated. The memory access components of the traditional dataflow architecture cannot support this complex memory access mode efficiently. 3) In traditional dataflow architecture, the double buffering mechanism is used to conceal the transmission delay. However, when low-precision data are transmitted, the utilization of the transmission bandwidth is significantly reduced, resulting in calculation delays that cannot cover the data transmission delay, and the double buffering mechanism faces the risk of failure, thereby affecting the performance and energy efficiency of the dataflow architecture. In order to solve the above problems, we optimize the dataflow architecture and design a low-precision neural networks accelerator named DPU_Q. First of all, a flexible and reconfigurable computing unit is designed, which dynamically reconstructs the data path according to the precision flag of the instruction. On the one hand, it can efficiently and flexibly support a variety of low-precision operations. On the other hand, the performance and throughput of the architecture can be further improved in this way. In addition, in order to solve the complex memory access mode of low-precision data, we design Scatter engine, which can splice and preprocess the low-precision data discretely distributed in the off-chip/low-level memory hierarchy to meet the format requirements of the on-chip/high-level memory hierarchy for data arrangement. At the same time, Scatter engine can effectively solve the problem of reduced bandwidth utilization when transmitting low-precision data. The transmission delay will not increase significantly, so it can be completely covered by the double buffer mechanism. Finally, a low-precision neural network scheduling method is proposed, which can fully reuse weights, activation values, reducing memory access overhead. Experiments show that compared with the same precision GPU (Titan Xp), state-of-the-art dataflow architecture (Eyeriss) and state-of-the-art low-precision neural network accelerator (BitFusion), DPU_Q achieves 3.18 \times , 6.05 \times , and 1.52 \times of performance improvement and 4.49 \times , 1.6 \times , and 1.13 \times of energy efficiency improvement, respectively.

HTML全文

参考文献(19)

施引文献

资源附件(0)