面向新一代国产异构众核处理器的数据流计算系统

肖谦; 赵美佳; 李名凡; 沈莉; 陈俊仕; 周文浩; 王飞; 安虹

doi:10.7544/issn1000-1239.202220562

面向新一代国产异构众核处理器的数据流计算系统

A Dataflow Computing System for New Generation of Domestic Heterogeneous Many-Core Processors

摘要

摘要: 如今，科学研究已从计算科学时代进入数据科学时代. 从海量数据中发现规律和突破科学发展瓶颈是数据科学范式的主要目标. 与此同时，高性能计算机（HPC）也越来越重视智能算力，在传统高性能计算方法的基础上融合人工智能算法（HPC+AI），更有利于在数据科学时代解决实际问题，并能充分发挥高性能计算机的智能算力. 不过，在国产HPC系统——特别是面向由新一代国产异构众核处理器sw26010pro构建的HPC系统——上支撑HPC+AI领域应用，则面临着诸多挑战. 提出了一种面向国产异构众核处理器的数据流计算系统swFLOWpro，支持使用TensorFlow接口构建数据流程序，实现对用户透明的众核加速，并实现了面向全处理器视角的两级并行策略. 经测试，系统针对典型核心计算，单核组众核加速比最高可达545倍、典型模型众核加速比最高可达346倍，全片6核组并行执行ResNet50模型训练，对比单核组加速比达到4.96倍，并行效率82.6%. 实验表明，swFLOWpro能够支持以深度学习为代表的数据流程序在国产异构众核处理器上的高效运行.

Abstract: Today, scientific research has moved from the era of computational science to the era of data science. Discovering laws from massive data and breaking through bottlenecks in scientific development are the main goals of the data science paradigm. At the same time, high performance computers are also paying more and more attention on intelligent computing power. Integrating AI algorithms on the basis of traditional high performance computing methods (HPC+AI) is more conducive to solving practical science problems in the era of data science, and can give full play to the intelligent computing power of high performance computers. However, on domestic HPC systems, especially on HPC systems constructed by the new generation of domestic heterogeneous many-core processors, there are many challenges to support HPC+AI programs. In this paper, we propose a data flow computing system for domestic heterogeneous many-core processors, which is called swFLOWpro. The system supports the use of TensorFlow interface to build data flow programs, and realizes many-core parallel acceleration transparent to users, and implements two-level parallel strategy based on the whole processor perspective. Testing on sw26010pro processor, swFLOWpro can get up to 545 times single core group (CG) many-core speedup ratio for typical OP, 346 times for typical deep learning models. Compared with the single CG of sw26010pro, we execute ResNet50 model on all the 6 CGs for one whole processor, and the speedup ration is up to 4.96 times, whose parallel efficiency is 82.6%. Experiments show that swFLOWpro can support the efficient execution of data flow programs represented by deep learning on domestic heterogeneous many-core processors.

HTML全文

参考文献(24)

施引文献

资源附件(0)