基于异构编程模型的共性算子移植与并行优化

马兆佳; 邵恩; 狄战元; 马立贤

doi:10.7544/issn1000-1239.202330869

基于异构编程模型的共性算子移植与并行优化

Porting and Parallel Optimization of Common Operators Based on Heterogeneous Programming Models

摘要

摘要: GPU作为构造大规模超算系统的核心计算部件，向着体系结构多样化和异构化的方向发展. 来自不同芯片厂商的GPU加速器具有差异较大的体系结构设计. 加速器类型和编程模型多样化是构建大规模超算系统的重要技术趋势. 多样化加速器要求开发者为多种硬件平台提供高性能共性算法库软件，然而这也导致了算法库软件重复开发问题. 为降低重复开发成本，统一编程模型SYCL（system-wide compute language）应运而生，并适配了多种硬件平台. 尽管如此，在不同硬件上，SYCL的性能仍不及各自原生编程模型. 因此，需要进一步优化SYCL的性能以将目前成熟完备的CUDA（compute unified device architecture）编程思路和高性能程序应用到SYCL中. 基于软硬件协同设计，提出了paraTRANS方法，该方法是面向跨异构编程模型SYCL代码移植过程中共性算子优化工具，并在不同场景下给出了对移植得到的SYCL的GEMM（general matrix multiplication）进行优化的方法. 评测了paraTRANS优化后基于SYCL的GEMM算子在NVIDIA RTX 3090和AMD MI100上的性能情况. 结果显示，在NVIDIA RTX 3090上，paraTRANS达到了96.95% CUDA原生算子的性能水平；在AMD MI100上，则接近CUDA在NVIDIA RTX 3090上硬件峰值百分比（100.47%）所表现出来的性能水平. 这些结果表明成功地将原生高性能CUDA算子代码移植并进一步优化至SYCL环境中，并为未来类似工作提供新颖且有效的优化思路.

Abstract: As the fundamental computing component in constructing large-scale supercomputing systems, GPUs are undergoing architectural diversity and heterogeneity. GPU accelerators from various chip manufacturers exhibit significant variations in their architectural designs. Accelerator diversity and programming model diversity are important technical trends for building large-scale supercomputing systems. Diverse accelerators require developers to provide high-performance software for multiple hardware platforms, resulting in software duplication. To reduce the cost of duplication, the unified programming model SYCL (system-wide compute language) adapts to multiple hardware platforms, but SYCL’s performance on different hardware is not as good as the native programming model of the platform, and SYCL’s performance needs to be further optimized. In order to be able to apply the mature and complete CUDA (compute unified device architecture) programming ideas and high-performance programs to SYCL, it is necessary to discuss the performance of high-performance CUDA programs ported to SYCL on multiple platforms and the ideas for further optimization. Based on software-hardware co-design, we propose paraTRANS: a common operator optimization system for the code migration process of cross-heterogeneous programming model SYCL, and give the optimization methods for the migrated SYCL GEMM (general matrix multiplication) in different scenarios. We evaluate the performance of SYCL GEMM optimized by paraTRANS, which can achieve 96.95% of CUDA’s FLOPS on the original NVIDIA RTX 3090, and 100.47% of CUDA’s hardware peak performance percentage on AMD MI100, both close to the level before migration. This paper provides ideas for porting high-performance CUDA code to SYCL and further optimization.

HTML全文

参考文献(21)

施引文献

资源附件(1)