Porting and Parallel Optimization of Common Operators Based on Heterogeneous Programming Models

Ma Zhaojia; Shao En; Di Zhanyuan; Ma Lixian

doi:10.7544/issn1000-1239.202330869

Ma Zhaojia, Shao En, Di Zhanyuan, Ma Lixian. Porting and Parallel Optimization of Common Operators Based on Heterogeneous Programming Models[J]. Journal of Computer Research and Development, 2025, 62(4): 1017-1032. DOI: 10.7544/issn1000-1239.202330869

Citation:

Porting and Parallel Optimization of Common Operators Based on Heterogeneous Programming Models

Graphical Abstract

Graphical Abstract

Abstract

Abstract

As the fundamental computing component in constructing large-scale supercomputing systems, GPUs are undergoing architectural diversity and heterogeneity. GPU accelerators from various chip manufacturers exhibit significant variations in their architectural designs. Accelerator diversity and programming model diversity are important technical trends for building large-scale supercomputing systems. Diverse accelerators require developers to provide high-performance software for multiple hardware platforms, resulting in software duplication. To reduce the cost of duplication, the unified programming model SYCL (system-wide compute language) adapts to multiple hardware platforms, but SYCL’s performance on different hardware is not as good as the native programming model of the platform, and SYCL’s performance needs to be further optimized. In order to be able to apply the mature and complete CUDA (compute unified device architecture) programming ideas and high-performance programs to SYCL, it is necessary to discuss the performance of high-performance CUDA programs ported to SYCL on multiple platforms and the ideas for further optimization. Based on software-hardware co-design, we propose paraTRANS: a common operator optimization system for the code migration process of cross-heterogeneous programming model SYCL, and give the optimization methods for the migrated SYCL GEMM (general matrix multiplication) in different scenarios. We evaluate the performance of SYCL GEMM optimized by paraTRANS, which can achieve 96.95% of CUDA’s FLOPS on the original NVIDIA RTX 3090, and 100.47% of CUDA’s hardware peak performance percentage on AMD MI100, both close to the level before migration. This paper provides ideas for porting high-performance CUDA code to SYCL and further optimization.

FullText(HTML)

References (21)

Supplements (1)

Cited By

Turn off MathJax

Article Contents

Porting and Parallel Optimization of Common Operators Based on Heterogeneous Programming Models

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content