Dataflow Architecture Optimization for Low-Precision Neural Networks

Fan Zhihua; Wu Xinxin; Li Wenming; Cao Huawei; An Xuejun; Ye Xiaochun; Fan Dongrui

doi:10.7544/issn1000-1239.202111275

Journal of Computer Research and Development > 2023 > 60(1): 43-58. > DOI: 10.7544/issn1000-1239.202111275 CSTR: 32373.14.issn1000-1239.202111275

Fan Zhihua, Wu Xinxin, Li Wenming, Cao Huawei, An Xuejun, Ye Xiaochun, Fan Dongrui. Dataflow Architecture Optimization for Low-Precision Neural Networks[J]. Journal of Computer Research and Development, 2023, 60(1): 43-58. DOI: 10.7544/issn1000-1239.202111275

Citation:

PDF (2000 KB)

Dataflow Architecture Optimization for Low-Precision Neural Networks

Fan Zhihua^{1, 2,},
Wu Xinxin^{1, 2,},
Li Wenming^1, ,,
Cao Huawei^1,,
An Xuejun^{1, 2,},
Ye Xiaochun^{1, 2,},
Fan Dongrui^{1, 2,}

1.
State Key Lab of Processors （Institute of Computing Technology，Chinese Academy of Sciences），Beijing 100190
2.
School of Computer Science and Technology，University of Chinese Academy of Sciences，Beijing 100049

Funds: This work was supported by the Strategic Priority Research Program of Chinese Academy of Sciences (XDC05000000), the National Natural Science Foundation of China (61732018, 61872335), the International Partnership Program of Chinese Academy of Sciences (171111KYSB20200002), the Open Project of Zhejiang Lab (2022PB0AB01), and the Youth Innovation Promotion Association of Chinese Academy of Sciences.

More Information

Received Date: December 23, 2021
Revised Date: June 06, 2022
Available Online: February 10, 2023

Graphical Abstract

Abstract

Abstract

The execution model of the dataflow architecture is similar to the execution of neural network algorithm, which can exploit more parallelism. However, with the development of low-precision neural networks, the research on dataflow architecture has not been developed for low-precision neural networks. When low-precision (INT8, INT4 or lower) neural networks are deployed in traditional dataflow architectures, they will face the following three challenges: 1) The data path of the traditional dataflow architecture does not match the low-precision data, which cannot reflect the performance and energy efficiency advantages of the low-precision neural networks. 2) Vectorized low-precision data are required to be arranged in order in the on-chip memory, but these data are arranged in a scattered manner in the off-chip memory hierarchy, which makes data loading and writing back operations more complicated. The memory access components of the traditional dataflow architecture cannot support this complex memory access mode efficiently. 3) In traditional dataflow architecture, the double buffering mechanism is used to conceal the transmission delay. However, when low-precision data are transmitted, the utilization of the transmission bandwidth is significantly reduced, resulting in calculation delays that cannot cover the data transmission delay, and the double buffering mechanism faces the risk of failure, thereby affecting the performance and energy efficiency of the dataflow architecture. In order to solve the above problems, we optimize the dataflow architecture and design a low-precision neural networks accelerator named DPU_Q. First of all, a flexible and reconfigurable computing unit is designed, which dynamically reconstructs the data path according to the precision flag of the instruction. On the one hand, it can efficiently and flexibly support a variety of low-precision operations. On the other hand, the performance and throughput of the architecture can be further improved in this way. In addition, in order to solve the complex memory access mode of low-precision data, we design Scatter engine, which can splice and preprocess the low-precision data discretely distributed in the off-chip/low-level memory hierarchy to meet the format requirements of the on-chip/high-level memory hierarchy for data arrangement. At the same time, Scatter engine can effectively solve the problem of reduced bandwidth utilization when transmitting low-precision data. The transmission delay will not increase significantly, so it can be completely covered by the double buffer mechanism. Finally, a low-precision neural network scheduling method is proposed, which can fully reuse weights, activation values, reducing memory access overhead. Experiments show that compared with the same precision GPU (Titan Xp), state-of-the-art dataflow architecture (Eyeriss) and state-of-the-art low-precision neural network accelerator (BitFusion), DPU_Q achieves 3.18 $\times$ , 6.05 $\times$ , and 1.52 $\times$ of performance improvement and 4.49 $\times$ , 1.6 $\times$ , and 1.13 $\times$ of energy efficiency improvement, respectively.
- dataflow architecture,
- low-precision neural network,
- quantization,
- reconfigurable architecture,
- direct memory access

FullText(HTML)

References (19)

References

[1]	Krizhevsky A, Sutskever I, Hinton G. Imagenet classification with deep convolutional neural networks[C] //Proc of the 25th Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2012: 1097−1105
[2]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint, arXiv: 1409.1556, 2014
[3]	Song Zhuoran, Fu Bangqi, Wu Feiyang, et al. DRQ: Dynamic region-based quantization for deep neural network acceleration[C] //Proc of the 47th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2020: 1010−1021
[4]	Ueki T, Keisuke I, Matsubara T, et al. AQSS: Accelerator of quantization neural networks with stochastic approach[C] //Proc of the 6th Int Symp on Computing and Networking Workshops. Piscataway, NJ: IEEE, 2018: 138−144
[5]	Park E, Kim D, Yoo S. Energy-efficient neural network accelerator based on outlier-aware low-precision computation[C] //Proc of the 45th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2018: 688−698
[6]	Sharma H, Park J, Suda N, et al. BitFusion: Bit-level dynamically composable architecture for accelerating deep neural network[C] //Proc of the 45th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2018: 764−775
[7]	Chen Yu-Hsin, Krishna T, Emer J, et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks[J]. IEEE Journal of Solid-State Circuits, 2017, 52(1): 127−138 doi: 10.1109/JSSC.2016.2616357
[8]	Wu Xinxin, Fan Zhihua, Liu Tianyu, et al. LRP: Predictive output activation based on SVD approach for CNNs acceleration[C] //Proc of the 25th Design, Automation & Test in Europe. Piscataway, NJ: IEEE, 2022: 837−842
[9]	Courbariaux M, Bengio Y, David J. BinaryConnect: Training deep neural networks with binary weights during propagations[C] //Proc of the 28th Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2015: 3123−3131
[10]	Zhang Donqing, Yang Jiaolong, Ye Dongqiangzi, et al. LQ-Nets: Learned quantization for highly accurate and compact deep neural networks[C] //Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 365−382
[11]	Wang Naigang, Choi J, Brand D, et al. Training deep neural networks with 8-bit floating point numbers[C] //Proc of the 31st Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2018: 7685−7694
[12]	Chen Tianshi, Du Zidong, Sun Ninghui, et al. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning[C] //Proc of the 19th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2014: 269−284
[13]	Wu Jiaxiang, Leng Cong, Wang Yonghang, et al. Quantized convolutional neural networks for mobile devices[C] //Proc of the 29th Int Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 4820−4828
[14]	Park E, Yoo S, Vajda P. Value-aware quantization for training and inference of neural networks[C] //Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 608−624
[15]	Deng Lei, Li Guoqi, Han Song, et al. Model compression and hardware acceleration for neural networks: A comprehensive survey[J]. Proceedings of IEEE, 2020, 108(4): 485−532 doi: 10.1109/JPROC.2020.2976475
[16]	Jung S, Son C, Lee S, et al. Learning to quantize deep networks by optimizing quantization intervals with task loss[C] //Proc of the 31st Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 4350−4359
[17]	Szegedy C, Liu Wei, Jia Yangqing, et al. Going deeper with convolutions[C] //Proc of the 27th Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 1−9
[18]	Han Song, Mao Huizi, Dally J. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding[J]. arXiv preprint, arXiv: 1510.00149v2, 2015
[19]	Ye Xiaochun, Fan Dongrui, Sun Ninghui, et al. SimICT: A fast and flexible framework for performance and power evaluation of large-scale architecture[C] //Proc of the 18th Int Symp on Low Power Electronics and Design (ISLPED). Piscataway, NJ: IEEE, 2013: 273−278

[1]	Wang Yanwei, Li Rengang, Xu Ran, Liu Junkai. Data Center Heterogeneous Acceleration Software-Hardware System-Level Platform Based on Reconfigurable Architecture[J]. Journal of Computer Research and Development, 2025, 62(4): 963-977. DOI: 10.7544/issn1000-1239.202440041
[2]	Li Rengang, Wang Yanwei, Hao Rui, Xiao Linge, Yang Le, Yang Guangwen, Kan Hongwei. Direct xPU: A Novel Distributed Heterogeneous Computing Architecture Optimized for Inter-node Communication Optimization[J]. Journal of Computer Research and Development, 2024, 61(6): 1388-1400. DOI: 10.7544/issn1000-1239.202440055
[3]	Xie Minhui, Lu Youyou, Feng Yangyang, Shu Jiwu. A Recommendation Model Inference System Based on GPU Direct Storage Access Architecture[J]. Journal of Computer Research and Development, 2024, 61(3): 589-599. DOI: 10.7544/issn1000-1239.202330402
[4]	Feng Xinyue, Yang Qiusong, Shi Lin, Wang Qing, Li Mingshu. Critical Memory Data Access Monitor Based on Dynamic Strategy Learning[J]. Journal of Computer Research and Development, 2019, 56(7): 1470-1487. DOI: 10.7544/issn1000-1239.2019.20180577
[5]	Mao Haiyu, Shu Jiwu. 3D Memristor Array Based Neural Network Processing in Memory Architecture[J]. Journal of Computer Research and Development, 2019, 56(6): 1149-1160. DOI: 10.7544/issn1000-1239.2019.20190099
[6]	Su Wen, Zhang Longbing, Gao Xiang, Su Menghao. A Cache Locking and Direct Cache Access Based Network Processing Optimization Method[J]. Journal of Computer Research and Development, 2014, 51(3): 681-690.
[7]	Cai Wanwei, Tai Yunfang, Liu Qi, Zhang Ge. Memory Virtulization on MIPS Architecture[J]. Journal of Computer Research and Development, 2013, 50(10): 2247-2252.
[8]	Shen Huanghui, Wang Zhensong, Zheng Weimin. An Efficient Memory Access Strategy for Transposition and Block Operation in Image Processing[J]. Journal of Computer Research and Development, 2013, 50(1): 188-196.
[9]	Liu Dan, Feng Yi, Tong Dong, Cheng Xu, and Wang Keyi. A Bus Arbitration Scheme for Memory Access Performance Optimization[J]. Journal of Computer Research and Development, 2012, 49(5): 1061-1071.
[10]	Wang Kai, Chen Fei, Li Qiang, Li Xiaomin, An Xuejun, Sun Ninghui. Research on Hyper-Node Controller for High Performance Computer[J]. Journal of Computer Research and Development, 2011, 48(1): 1-8.

Cited By

Cited by

Periodical cited type(2)

1.	穆宇栋，李文明，范志华，吴萌，吴海彬，安学军，叶笑春，范东睿. 面向YOLO神经网络的数据流架构优化研究. 计算机学报. 2025(01): 82-99 .
2.	冯仕豪. 应用5G物联网技术的群控机器人多主机通信方法. 物联网技术. 2024(10): 56-60 .