A Dataflow Computing System for New Generation of Domestic Heterogeneous Many-Core Processors

Xiao Qian; Zhao Meijia; Li Mingfan; Shen Li; Chen Junshi; Zhou Wenhao; Wang Fei; An Hong

doi:10.7544/issn1000-1239.202220562

Journal of Computer Research and Development > 2023 > 60(10): 2405-2417. > DOI: 10.7544/issn1000-1239.202220562 CSTR: 32373.14.issn1000-1239.202220562

Xiao Qian, Zhao Meijia, Li Mingfan, Shen Li, Chen Junshi, Zhou Wenhao, Wang Fei, An Hong. A Dataflow Computing System for New Generation of Domestic Heterogeneous Many-Core Processors[J]. Journal of Computer Research and Development, 2023, 60(10): 2405-2417. DOI: 10.7544/issn1000-1239.202220562

Citation:

PDF (1644 KB)

A Dataflow Computing System for New Generation of Domestic Heterogeneous Many-Core Processors

1.
Institute of Computer Science and Technology, University of Science and Technology of China, Hefei 230026
2.
National Suppercomputer Center in Wuxi, Wuxi, Jiangsu 214100
3.
Department of Computer Science and Technology, Tsinghua University, Beijing 100084

More Information

Author Bio:
Xiao Qian: born in 1988. PhD candidate. His main research interests include compiler optimization, data flow computing, and AI framework

Zhao Meijia: born in 1992. Master. Her main research interest includes AI framework

Li Mingfan: born in 1995. PhD candidate. His research interests include dataflow system, parallel and distributed computing in heterogeneous environments

Shen Li: born in 1981. PhD candidate. Her main research interests include compiler and AI basic software

Chen Junshi: born in 1990. PhD. His main research interests include high performance computing, general processor architecture, and optimization of scientific applications on large scale systems

Zhou Wenhao: born in 1992. Master. His main research interests include compiler optimization and many-core program environment

Wang Fei: born in 1981. PhD candidate. His main research interests include compiler optimization and many-core program environment

An Hong: born in 1963. PhD, professor, PhD supervisor. Her main research interests include parallel computing system and many-core chip architecture
Received Date: June 15, 2022
Revised Date: January 15, 2023
Available Online: May 22, 2023

Graphical Abstract

Abstract

Abstract

Today, scientific research has moved from the era of computational science to the era of data science. Discovering laws from massive data and breaking through bottlenecks in scientific development are the main goals of the data science paradigm. At the same time, high performance computers are also paying more and more attention on intelligent computing power. Integrating AI algorithms on the basis of traditional high performance computing methods (HPC+AI) is more conducive to solving practical science problems in the era of data science, and can give full play to the intelligent computing power of high performance computers. However, on domestic HPC systems, especially on HPC systems constructed by the new generation of domestic heterogeneous many-core processors, there are many challenges to support HPC+AI programs. In this paper, we propose a data flow computing system for domestic heterogeneous many-core processors, which is called swFLOWpro. The system supports the use of TensorFlow interface to build data flow programs, and realizes many-core parallel acceleration transparent to users, and implements two-level parallel strategy based on the whole processor perspective. Testing on sw26010pro processor, swFLOWpro can get up to 545 times single core group (CG) many-core speedup ratio for typical OP, 346 times for typical deep learning models. Compared with the single CG of sw26010pro, we execute ResNet50 model on all the 6 CGs for one whole processor, and the speedup ration is up to 4.96 times, whose parallel efficiency is 82.6%. Experiments show that swFLOWpro can support the efficient execution of data flow programs represented by deep learning on domestic heterogeneous many-core processors.
- dataflow,
- deep learning,
- heterogeneous many-core,
- swFLOWpro system,
- high performance computing

FullText(HTML)

References (24)

References

[1]	Jia Weile, Wang Han, Chen Mohan, et al. Pushing the limit of molecular dynamic with abinitio accuracy to 100 million atoms with machine learning[C] //Proc of the 33rd Int Conf for High Performance Computing, Networking, Storge and Analysis(SC’20) . Piscataway, NJ: IEEE, 2020: 1−14
[2]	Abadi M, Barham P, Chen Jianmin , et al. TensorFlow: A system for large-scale machine learning[C] //Proc of the 12th USENIX Symp on Operating Ssytems Design and Implementation(OSDI’2016). Berkeley, CA: USENIX Association, 2016: 265−283
[3]	Paszke A, Gross S, Massa F, et al. Pytorch: An imperative style high-performance deep learning libray[J]. Advances in Neural Information Processing Systems, 2019, 32: 8026−8037
[4]	Liu Yong, Liu Xin, Li Fang, et al. Closing the “quantum supremacy” gap: Achieving real-time simulation of a random quantum circuit using a new Sunway supercomputer [C/OL] //Proc of the 34th Int Conf for High Performance Computing, Networking, Storge and Analysis(SC’21) . Piscataway, NJ: IEEE, 2021[2023-01-11].https://arvix.org/abs/2110.14502
[5]	胡向东,柯希明,尹飞,等. 高性能众核处理器申威26010[J]. 计算机研究与发展,2021,58(6):1155−1165 doi: 10.7544/issn1000-1239.2021.20201041 Hu Xiangdong, Ke Ximing, Yin Fei, et al. Shenwei-26010: A high-performance many-core prcocessor[J]. Journal of Computer Research and Development, 2021, 58(6): 1155−1165 (in Chinese) doi: 10.7544/issn1000-1239.2021.20201041
[6]	Tony N, Vinay G, Karthikeyan S, et al. Exploring the potential of heterogeneous von neumann/dataflow execution models[C] //Proc of the 42nd Annual Int Symp on Computer Architecture(ISCA’15). Piscataway, NJ: IEEE, 2015: 298−310
[7]	Sankaralingam K, Nagarajan R, Liu Haiming, et al. Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture[C] //Proc of the 30th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE , 2003: 422−433
[8]	Burger D, Keckler S W, Mckinley K S, et al. Scaling to the end of silicon with EDGE architectures[J]. Computer, 2004, 37(7): 44−55 doi: 10.1109/MC.2004.65
[9]	Zuckerman S, Suetterlein J, Knauerhase R, et al. Using a Codelet program execution model for exascale machines [C] //Proc of the 1st Int Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era. New York: ACM, 2011: 64−69
[10]	Lauderdale C, Khan R. Towards a Codelet-based runtime for exascale computing[C]//Proce of the 2nd Int Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era. New York: ACM, 2012: 21−26
[11]	Su Zhichao, Chen Junshi, Lin Han, et al. A dataflow-based runtime support on a 100P actual system [C] //Proc of the 15th Int Symp on Parallel and Distributed Processing with Applications. Piscataway, NJ: IEEE, 2017: 599−606
[12]	Zhao Wenlai, Fu Haohuan, Fang Jiarui, et al. Optimizing convolutional neural networks on the Sunway Taihulight supercomputer[J]. ACM Transactions on Architecture and Code Optimization, 2018, 15(1): 13−25
[13]	Li Liandeng, Fang Jiarui, Fu Haohuan, et al. swCaffe: A parallel framework for accelerating deep learning applications on Sunway Taihulight[C] //Proc of the 20th IEEE Int Conf on Cluster Computing (CLUSTER). Piscataway, NJ: IEEE, 2018: 413−422
[14]	Jia Yangqing, Shelhamer E, Donahue J, et al. Caffe: Convolutional architecture for fast feature embedding[C]// Proc of the 22nd ACM Int Conf on Multimedia. New York: ACM, 2014: 675−678
[15]	Li Mingfan, Lin Han, Chen Junshi, et al. swFLOW: A large-scale distributed framework for deep learning on Sunway Taihulight supercomputer[J]. Information Sciences, 2021, 570(9): 831−847
[16]	Deepak N, Mohammad S, Jared C, et al. Efficient large-scale language model training on GPU clusters using megatron-LM [C] //Proc of the 34th Int Conf for High Performance Computing, Networking, Storage and Analysis. (SC’21). Piscataway, NJ: IEEE, 2021: 401−412
[17]	Xu Yuanzhong, HyoukJoong L, Chen Dehao, et al. Gspmd: General and scalable parallelization for ml computation graphs [J]. arXiv preprint, arXiv: 2105.04663, 2021
[18]	Fan Shiqing, Yi Rong, Meng Chen, et al. DAPPLE: A pipelined data parallel approach for training large models [C] //Proc of the 26th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2021: 431–445
[19]	Zheng Lianmin, Li Zhuohan, Zhang Hao, et al. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning[C] //Proc of the 16th USENIX Symp on Operating Systems Design and Implementation. New York: ACM, 2022: 559–578
[20]	Krizhevsky A, Sutskever I, Hinton G, et al. ImageNet classification with deep convolutional neural networks[C] //Proc of the 26th Annual Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2012: 1097−1105
[21]	Alexis C, Holger S, LeCun Y, et al. Very deep convolutional networks for large-scale image recognition[C]//Proc of the 15th European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2017: 1107−1116
[22]	He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition[C] //Proc of the 33rd IEEE Conf on Computer Vision and Pattern Recognition(CVPR). Piscataway, NJ: IEEE, 2016: 770−778
[23]	Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the Inception architecture for computer vision[C] //Proc of the 33rd IEEE Conf on Computer Vision and Pattern Recognition(CVPR). Piscataway, NJ: IEEE, 2016: 551−561
[24]	Szegedy C, Ioffe S, Vanhoucke V, et al. Inception-v4, inception-resnet and the impact of residual connections on learning[J]. arXiv preprint, arXiv: 1602.07261, 2016

[1]	Lin Hanyue, Wu Jingya, Lu Wenyan, Zhong Langhui, Yan Guihai. Neptune: A Framework for Generic Network Processor Microarchitecture Modeling and Performance Simulation[J]. Journal of Computer Research and Development, 2025, 62(5): 1091-1107. DOI: 10.7544/issn1000-1239.202440084
[2]	Zhang Qianlong, Hou Rui, Yang Sibo, Zhao Boyan, Zhang Lixin. The Role of Architecture Simulators in the Process of CPU Design[J]. Journal of Computer Research and Development, 2019, 56(12): 2702-2719. DOI: 10.7544/issn1000-1239.2019.20190044
[3]	Liu He, Ji Yu, Han Jianhui, Zhang Youhui, Zheng Weimin. Training and Software Simulation for ReRAM-Based LSTM Neural Network Acceleration[J]. Journal of Computer Research and Development, 2019, 56(6): 1182-1191. DOI: 10.7544/issn1000-1239.2019.20190113
[4]	Yang Meifang, Che Yonggang, Gao Xiang. Heterogeneous Parallel Optimization of an Engine Combustion Simulation Application with the OpenMP 4.0 Standard[J]. Journal of Computer Research and Development, 2018, 55(2): 400-408. DOI: 10.7544/issn1000-1239.2018.20160872
[5]	Liu Yuchen, Wang Jia, Chen Yunji, Jiao Shuai. Survey on Computer System Simulator[J]. Journal of Computer Research and Development, 2015, 52(1): 3-15. DOI: 10.7544/issn1000-1239.2015.20140104
[6]	Lü Huiwei, Cheng Yuan, Bai Lu, Chen Mingyu, Fan Dongrui, Sun Ninghui. Parallel Simulation of Many-Core Processor and Many-Core Clusters[J]. Journal of Computer Research and Development, 2013, 50(5): 1110-1117.
[7]	Yu Lisheng, Zhang Yansong, Wang Shan, and Zhang Qian. Research on Simulative Column-Storage Model Policy Based on Row-Storage Model[J]. Journal of Computer Research and Development, 2010, 47(5): 878-885.
[8]	Liu Shiguang, Chai Jiawei, Wen Yuan. A New Method for Fast Simulation of 3D Clouds[J]. Journal of Computer Research and Development, 2009, 46(9): 1417-1423.
[9]	Mao Chengying, Lu Yansheng. Strategies of Regression Test Case Selection for Component-Based Software[J]. Journal of Computer Research and Development, 2006, 43(10): 1767-1774.
[10]	Wang Shihao, Wang Xinmin, Liu Mingye. Software Simulation for Hardware/Software Co-Verification[J]. Journal of Computer Research and Development, 2005, 42(3).