Citation: | Xiao Qian, Zhao Meijia, Li Mingfan, Shen Li, Chen Junshi, Zhou Wenhao, Wang Fei, An Hong. A Dataflow Computing System for New Generation of Domestic Heterogeneous Many-Core Processors[J]. Journal of Computer Research and Development, 2023, 60(10): 2405-2417. DOI: 10.7544/issn1000-1239.202220562 |
Today, scientific research has moved from the era of computational science to the era of data science. Discovering laws from massive data and breaking through bottlenecks in scientific development are the main goals of the data science paradigm. At the same time, high performance computers are also paying more and more attention on intelligent computing power. Integrating AI algorithms on the basis of traditional high performance computing methods (HPC+AI) is more conducive to solving practical science problems in the era of data science, and can give full play to the intelligent computing power of high performance computers. However, on domestic HPC systems, especially on HPC systems constructed by the new generation of domestic heterogeneous many-core processors, there are many challenges to support HPC+AI programs. In this paper, we propose a data flow computing system for domestic heterogeneous many-core processors, which is called swFLOWpro. The system supports the use of TensorFlow interface to build data flow programs, and realizes many-core parallel acceleration transparent to users, and implements two-level parallel strategy based on the whole processor perspective. Testing on sw26010pro processor, swFLOWpro can get up to 545 times single core group (CG) many-core speedup ratio for typical OP, 346 times for typical deep learning models. Compared with the single CG of sw26010pro, we execute ResNet50 model on all the 6 CGs for one whole processor, and the speedup ration is up to 4.96 times, whose parallel efficiency is 82.6%. Experiments show that swFLOWpro can support the efficient execution of data flow programs represented by deep learning on domestic heterogeneous many-core processors.
[1] |
Jia Weile, Wang Han, Chen Mohan, et al. Pushing the limit of molecular dynamic with abinitio accuracy to 100 million atoms with machine learning[C] //Proc of the 33rd Int Conf for High Performance Computing, Networking, Storge and Analysis(SC’20) . Piscataway, NJ: IEEE, 2020: 1−14
|
[2] |
Abadi M, Barham P, Chen Jianmin , et al. TensorFlow: A system for large-scale machine learning[C] //Proc of the 12th USENIX Symp on Operating Ssytems Design and Implementation(OSDI’2016). Berkeley, CA: USENIX Association, 2016: 265−283
|
[3] |
Paszke A, Gross S, Massa F, et al. Pytorch: An imperative style high-performance deep learning libray[J]. Advances in Neural Information Processing Systems, 2019, 32: 8026−8037
|
[4] |
Liu Yong, Liu Xin, Li Fang, et al. Closing the “quantum supremacy” gap: Achieving real-time simulation of a random quantum circuit using a new Sunway supercomputer [C/OL] //Proc of the 34th Int Conf for High Performance Computing, Networking, Storge and Analysis(SC’21) . Piscataway, NJ: IEEE, 2021[2023-01-11].https://arvix.org/abs/2110.14502
|
[5] |
胡向东,柯希明,尹飞,等. 高性能众核处理器申威26010[J]. 计算机研究与发展,2021,58(6):1155−1165 doi: 10.7544/issn1000-1239.2021.20201041
Hu Xiangdong, Ke Ximing, Yin Fei, et al. Shenwei-26010: A high-performance many-core prcocessor[J]. Journal of Computer Research and Development, 2021, 58(6): 1155−1165 (in Chinese) doi: 10.7544/issn1000-1239.2021.20201041
|
[6] |
Tony N, Vinay G, Karthikeyan S, et al. Exploring the potential of heterogeneous von neumann/dataflow execution models[C] //Proc of the 42nd Annual Int Symp on Computer Architecture(ISCA’15). Piscataway, NJ: IEEE, 2015: 298−310
|
[7] |
Sankaralingam K, Nagarajan R, Liu Haiming, et al. Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture[C] //Proc of the 30th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE , 2003: 422−433
|
[8] |
Burger D, Keckler S W, Mckinley K S, et al. Scaling to the end of silicon with EDGE architectures[J]. Computer, 2004, 37(7): 44−55 doi: 10.1109/MC.2004.65
|
[9] |
Zuckerman S, Suetterlein J, Knauerhase R, et al. Using a Codelet program execution model for exascale machines [C] //Proc of the 1st Int Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era. New York: ACM, 2011: 64−69
|
[10] |
Lauderdale C, Khan R. Towards a Codelet-based runtime for exascale computing[C]//Proce of the 2nd Int Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era. New York: ACM, 2012: 21−26
|
[11] |
Su Zhichao, Chen Junshi, Lin Han, et al. A dataflow-based runtime support on a 100P actual system [C] //Proc of the 15th Int Symp on Parallel and Distributed Processing with Applications. Piscataway, NJ: IEEE, 2017: 599−606
|
[12] |
Zhao Wenlai, Fu Haohuan, Fang Jiarui, et al. Optimizing convolutional neural networks on the Sunway Taihulight supercomputer[J]. ACM Transactions on Architecture and Code Optimization, 2018, 15(1): 13−25
|
[13] |
Li Liandeng, Fang Jiarui, Fu Haohuan, et al. swCaffe: A parallel framework for accelerating deep learning applications on Sunway Taihulight[C] //Proc of the 20th IEEE Int Conf on Cluster Computing (CLUSTER). Piscataway, NJ: IEEE, 2018: 413−422
|
[14] |
Jia Yangqing, Shelhamer E, Donahue J, et al. Caffe: Convolutional architecture for fast feature embedding[C]// Proc of the 22nd ACM Int Conf on Multimedia. New York: ACM, 2014: 675−678
|
[15] |
Li Mingfan, Lin Han, Chen Junshi, et al. swFLOW: A large-scale distributed framework for deep learning on Sunway Taihulight supercomputer[J]. Information Sciences, 2021, 570(9): 831−847
|
[16] |
Deepak N, Mohammad S, Jared C, et al. Efficient large-scale language model training on GPU clusters using megatron-LM [C] //Proc of the 34th Int Conf for High Performance Computing, Networking, Storage and Analysis. (SC’21). Piscataway, NJ: IEEE, 2021: 401−412
|
[17] |
Xu Yuanzhong, HyoukJoong L, Chen Dehao, et al. Gspmd: General and scalable parallelization for ml computation graphs [J]. arXiv preprint, arXiv: 2105.04663, 2021
|
[18] |
Fan Shiqing, Yi Rong, Meng Chen, et al. DAPPLE: A pipelined data parallel approach for training large models [C] //Proc of the 26th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2021: 431–445
|
[19] |
Zheng Lianmin, Li Zhuohan, Zhang Hao, et al. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning[C] //Proc of the 16th USENIX Symp on Operating Systems Design and Implementation. New York: ACM, 2022: 559–578
|
[20] |
Krizhevsky A, Sutskever I, Hinton G, et al. ImageNet classification with deep convolutional neural networks[C] //Proc of the 26th Annual Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2012: 1097−1105
|
[21] |
Alexis C, Holger S, LeCun Y, et al. Very deep convolutional networks for large-scale image recognition[C]//Proc of the 15th European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2017: 1107−1116
|
[22] |
He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition[C] //Proc of the 33rd IEEE Conf on Computer Vision and Pattern Recognition(CVPR). Piscataway, NJ: IEEE, 2016: 770−778
|
[23] |
Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the Inception architecture for computer vision[C] //Proc of the 33rd IEEE Conf on Computer Vision and Pattern Recognition(CVPR). Piscataway, NJ: IEEE, 2016: 551−561
|
[24] |
Szegedy C, Ioffe S, Vanhoucke V, et al. Inception-v4, inception-resnet and the impact of residual connections on learning[J]. arXiv preprint, arXiv: 1602.07261, 2016
|
[1] | Lin Hanyue, Wu Jingya, Lu Wenyan, Zhong Langhui, Yan Guihai. Neptune: A Framework for Generic Network Processor Microarchitecture Modeling and Performance Simulation[J]. Journal of Computer Research and Development, 2025, 62(5): 1091-1107. DOI: 10.7544/issn1000-1239.202440084 |
[2] | Zhang Qianlong, Hou Rui, Yang Sibo, Zhao Boyan, Zhang Lixin. The Role of Architecture Simulators in the Process of CPU Design[J]. Journal of Computer Research and Development, 2019, 56(12): 2702-2719. DOI: 10.7544/issn1000-1239.2019.20190044 |
[3] | Liu He, Ji Yu, Han Jianhui, Zhang Youhui, Zheng Weimin. Training and Software Simulation for ReRAM-Based LSTM Neural Network Acceleration[J]. Journal of Computer Research and Development, 2019, 56(6): 1182-1191. DOI: 10.7544/issn1000-1239.2019.20190113 |
[4] | Yang Meifang, Che Yonggang, Gao Xiang. Heterogeneous Parallel Optimization of an Engine Combustion Simulation Application with the OpenMP 4.0 Standard[J]. Journal of Computer Research and Development, 2018, 55(2): 400-408. DOI: 10.7544/issn1000-1239.2018.20160872 |
[5] | Liu Yuchen, Wang Jia, Chen Yunji, Jiao Shuai. Survey on Computer System Simulator[J]. Journal of Computer Research and Development, 2015, 52(1): 3-15. DOI: 10.7544/issn1000-1239.2015.20140104 |
[6] | Lü Huiwei, Cheng Yuan, Bai Lu, Chen Mingyu, Fan Dongrui, Sun Ninghui. Parallel Simulation of Many-Core Processor and Many-Core Clusters[J]. Journal of Computer Research and Development, 2013, 50(5): 1110-1117. |
[7] | Yu Lisheng, Zhang Yansong, Wang Shan, and Zhang Qian. Research on Simulative Column-Storage Model Policy Based on Row-Storage Model[J]. Journal of Computer Research and Development, 2010, 47(5): 878-885. |
[8] | Liu Shiguang, Chai Jiawei, Wen Yuan. A New Method for Fast Simulation of 3D Clouds[J]. Journal of Computer Research and Development, 2009, 46(9): 1417-1423. |
[9] | Mao Chengying, Lu Yansheng. Strategies of Regression Test Case Selection for Component-Based Software[J]. Journal of Computer Research and Development, 2006, 43(10): 1767-1774. |
[10] | Wang Shihao, Wang Xinmin, Liu Mingye. Software Simulation for Hardware/Software Co-Verification[J]. Journal of Computer Research and Development, 2005, 42(3). |