• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Ma Zhaojia, Shao En, Di Zhanyuan, Ma Lixian. Porting and Parallel Optimization of Common Operators Based on Heterogeneous Programming Models[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202330869
Citation: Ma Zhaojia, Shao En, Di Zhanyuan, Ma Lixian. Porting and Parallel Optimization of Common Operators Based on Heterogeneous Programming Models[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202330869

Porting and Parallel Optimization of Common Operators Based on Heterogeneous Programming Models

Funds: This work was supported by the National Key Research and Development Program of China (2021YFB0300202), the National Natural Science Foundation of China (62102396), the Beijing Nova Program (Z211100002121143, 20220484217), the Youth Innovation Promotion Association of Chinese Academy of Sciences (2021099), the CCF-Ant Research Fund (CCF-AFSGRF20230207), the Pilot for Major Scientific Research Facility of Jiangsu Province (BM2021800), and the Innovation Funding of ICT, CAS (E461030).
More Information
  • Author Bio:

    Ma Zhaojia: born in 2000. Master candidate. His main research interest includes high-performance computing

    Shao En: born in 1988. PhD, senior engineer, master supervisor. Senior member of CCF. His main research interests include computer interconnect network, SYCL, and high-performance computing and systems software

    Di Zhanyuan: born in 1999. PhD candidate. His main research interests include parallel computing, compilation optimization, and code generation

    Ma Lixian: born in 1993. PhD candidate, engineer. His main research interests include high performance computing and machine learning

  • Received Date: October 30, 2023
  • Revised Date: July 01, 2024
  • Accepted Date: August 08, 2024
  • Available Online: August 13, 2024
  • As the fundamental computing component in constructing large-scale supercomputing systems, GPUs are undergoing architectural diversity and heterogeneity. GPU accelerators from various chip manufacturers exhibit significant variations in their architectural designs. Accelerator diversity and programming model diversity are important technical trends for building large-scale supercomputing systems. Diverse accelerators require developers to provide high-performance software for multiple hardware platforms, resulting in software duplication. To reduce the cost of duplication, the unified programming model SYCL (system-wide compute language) adapts to multiple hardware platforms, but SYCL’s performance on different hardware is not as good as the native programming model of the platform, and SYCL’s performance needs to be further optimized. In order to be able to apply the mature and complete CUDA (compute unified device architecture) programming ideas and high-performance programs to SYCL, it is necessary to discuss the performance of high-performance CUDA programs ported to SYCL on multiple platforms and the ideas for further optimization. Based on software-hardware co-design, we propose paraTRANS: a common operator optimization system for the code migration process of cross-heterogeneous programming model SYCL, and give the optimization methods for the migrated SYCL GEMM (general matrix multiplication) in different scenarios. We evaluate the performance of SYCL GEMM optimized by paraTRANS, which can achieve 96.95% of CUDA’s FLOPS on the original NVIDIA RTX 3090, and 100.47% of CUDA’s hardware peak performance percentage on AMD MI100, both close to the level before migration. This paper provides ideas for porting high-performance CUDA code to SYCL and further optimization.

  • [1]
    Schneider D. The exascale era is upon us: The frontier supercomputer may be the first to reach 1, 000, 000, 000, 000, 000, 000 operations per second[J]. IEEE Spectrum, 2022, 59(1): 34−35 doi: 10.1109/MSPEC.2022.9676353
    [2]
    Brodtkorb A R, Dyken C, Hagen T R, et al. State-of-the-art in heterogeneous computing[J]. Scientific Programming, 2010, 18(1): 1−33 doi: 10.1155/2010/540159
    [3]
    Munshi A. The openCL specification[C/OL]//Proc of the 21st IEEE Hot Chips Symp (HCS). Piscataway, NJ: IEEE, 2009[2023-04-19]. http://ieeexplore.ieee.org/document/7478342
    [4]
    Heroux M A, McInnes L C, Thakur R, et al. ECP software technology capability assessment report[R/OL]. 2020[2023-04-23]. https://doi.org/10.2172/1760096
    [5]
    Intel. Intel/LLVM: Intel staging area for llvm. org contribution[CP/OL]. (2023-04-19)[2023-04-19]. https://github.com/intel/llvm
    [6]
    AMD ROCm Software. HIPIFY[CP/OL]. (2023-06-26)[2023-06-27]. https://github. com/ROCm-Developer-Tools/HIPIFY
    [7]
    Intel. Migrate CUDA* to DPC++ code: Intel DPC++ compatibility tool [EB/OL]. [2023-07-07]. https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html
    [8]
    Zhai Yujia. How to optimize SGEMM on NVIDIA GPUs[CP/OL]. (2023-04-19)[2023-04-21]. https://github.com/yzhaiustc/Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
    [9]
    Baidu Research. DeepBench[CP/OL]. (2023-06-28)[2023-07-04]. https: //github. com/baidu-research/DeepBench
    [10]
    Castaño G, Faqir-Rhazoui Y, García C, et al. Evaluation of Intel’s DPC++ compatibility tool in heterogeneous computing[J]. Journal of Parallel and Distributed Computing, 2022, 165: 120−129 doi: 10.1016/j.jpdc.2022.03.017
    [11]
    Jin Zheming, Vetter J S. Performance portability study of epistasis detection using SYCL on NVIDIA GPU[C/OL]//Proc of the 13th ACM Int Conf on Bioinformatics, Computational Biology and Health Informatics. New York: ACM, 2022[2023-04-23]. https://dl.acm.org/doi/10.1145/3535508.3545591
    [12]
    Christgau S, Steinke T. Porting a legacy cuda stencil code to oneapi[C]//Proc of the 34th IEEE Int Parallel and Distributed Processing Symp Workshops (IPDPSW). Piscataway, NJ: IEEE, 2020: 359−367
    [13]
    Hagerty N, Vergara V G M, Tharrington A. Studying performance portability of lammps across diverse gpu-based platforms[R/OL]. 2022[2023-04-23]. https://doi.org/10.1002/cpe.7895
    [14]
    Jin Zheming, Vetter J S. Understanding performance portability of Bioinformatics applications in SYCL on an NVIDIA GPU[C]//Proc of the 16th IEEE Int Conf on Bioinformatics and Biomedicine (BIBM). Piscataway, NJ: IEEE, 2022: 2190−2195
    [15]
    Deakin T, McIntosh-Smith S. Evaluating the performance of HPC-style SYCL applications[C/OL]//Proc of the 8th Int Workshop on OpenCL. New York: ACM, 2020[2023-04-19]. https://dl.acm.org/doi/10.1145/3388333.3388643
    [16]
    Jin Zheming, Vetter J S. Evaluating nonuniform reduction in HIP and SYCL on GPUs[C]//Proc of the 8th Int Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD). Piscataway, NJ: IEEE, 2022: 37−43
    [17]
    Volokitin V, Bashinov A, Efimenko E, et al. High performance implementation of Boris particle pusher on DPC++. A first look at oneAPI[C]//Proc of the 14th Int Conf on Parallel Computing Technologies. Berlin: Springer, 2021: 288−300
    [18]
    Costanzo M, Rucci E, Sánchez C G, et al. Assessing opportunities of SYCL and Intel oneAPI for biological sequence alignment[J]. arXiv preprint, arXiv: 2211.10769, 2022
    [19]
    Da Silva H C, Pisani F, Borin E. A comparative study of SYCL, OpenCL, and OpenMP[C]//Proc of the 2016 Int Symp on Computer Architecture and High Performance Computing Workshops (SBAC-PADW). Piscataway, NJ: IEEE, 2016: 61−66
    [20]
    Tsai Y M, Cojean T, Anzt H. Porting sparse linear algebra to Intel GPUs[C]//Proc of the 27th European Conf on Parallel Processing. Berlin: Springer, 2021: 57−68
    [21]
    王一超,林新华,蔡林金,等. 太湖之光上利用OpenACC移植和优化GTC-P[J]. 计算机研究与发展,2018,55(4):875−884 doi: 10.7544/issn1000-1239.2018.20160871

    Wang Yichao, Lin Xinhua, Cai Linjin, et al. Porting and optimizing GTC-P on TaihuLight supercomputer with OpenACC[J]. Journal of Computer Research and Development, 2018, 55(4): 875−884 (in Chinese) doi: 10.7544/issn1000-1239.2018.20160871
  • Related Articles

    [1]Wang Houzhen, Qin Wanying, Liu Qin, Yu Chunwu, Shen Zhidong. Identity Based Group Key Distribution Scheme[J]. Journal of Computer Research and Development, 2023, 60(10): 2203-2217. DOI: 10.7544/issn1000-1239.202330457
    [2]Chen Yewang, Shen Lianlian, Zhong Caiming, Wang Tian, Chen Yi, Du Jixiang. Survey on Density Peak Clustering Algorithm[J]. Journal of Computer Research and Development, 2020, 57(2): 378-394. DOI: 10.7544/issn1000-1239.2020.20190104
    [3]Zhang Qikun, Gan Yong, Wang Ruifang, Zheng Jiamin, Tan Yu’an. Inter-Cluster Asymmetric Group Key Agreement[J]. Journal of Computer Research and Development, 2018, 55(12): 2651-2663. DOI: 10.7544/issn1000-1239.2018.20170651
    [4]Xu Xiao, Ding Shifei, Sun Tongfeng, Liao Hongmei. Large-Scale Density Peaks Clustering Algorithm Based on Grid Screening[J]. Journal of Computer Research and Development, 2018, 55(11): 2419-2429. DOI: 10.7544/issn1000-1239.2018.20170227
    [5]Wang Haiyan, Dong Maowei. Latent Group Recommendation Based on Dynamic Probabilistic Matrix Factorization Model Integrated with CNN[J]. Journal of Computer Research and Development, 2017, 54(8): 1853-1863. DOI: 10.7544/issn1000-1239.2017.20170344
    [6]Gong Shufeng, Zhang Yanfeng. EDDPC: An Efficient Distributed Density Peaks Clustering Algorithm[J]. Journal of Computer Research and Development, 2016, 53(6): 1400-1409. DOI: 10.7544/issn1000-1239.2016.20150616
    [7]Zhang Qikun, Wang Ruifang, Tan Yu'an. Identity-Based Authenticated Asymmetric Group Key Agreement[J]. Journal of Computer Research and Development, 2014, 51(8): 1727-1738. DOI: 10.7544/issn1000-1239.2014.20121165
    [8]Zhu Mu, Meng Fanrong, and Zhou Yong. Density-Based Link Clustering Algorithm for Overlapping Community Detection[J]. Journal of Computer Research and Development, 2013, 50(12): 2520-2530.
    [9]Wang Feng, Zhou Yousheng, Gu Lize, Yang Yixian. A Multi-Policies Threshold Signature Scheme with Group Verifiability[J]. Journal of Computer Research and Development, 2012, 49(3): 499-505.
    [10]Cao Jia, Lu Shiwen. Research on Topology Discovery in the Overlay Multicast[J]. Journal of Computer Research and Development, 2006, 43(5): 784-790.
  • Cited by

    Periodical cited type(7)

    1. 毛伊敏,甘德瑾,廖列法,陈志刚. 基于Spark框架和ASPSO的并行划分聚类算法. 通信学报. 2022(03): 148-163 .
    2. 王永贵,林佳敏,何佳玉. 融合领导者影响与隐式信任度的群组推荐方法. 计算机工程与应用. 2022(09): 98-106 .
    3. 刘鑫,梅红岩,王嘉豪,李晓会. 图神经网络推荐方法研究. 计算机工程与应用. 2022(10): 41-49 .
    4. 刘聪,谢莉,杨慧中. 基于改进DPC的青霉素发酵过程多模型软测量建模. 化工学报. 2021(03): 1606-1615 .
    5. 刘功民,朱俊杰. WSN中利用双重接收器结合自适应加权数据融合的簇首优化聚类算法. 计算机应用与软件. 2021(05): 145-151 .
    6. 任昌鸿,安军. 改进PSO结合DSA技术的无线传感器网络均衡密度聚类方法. 计算机应用与软件. 2020(08): 122-129 .
    7. 许晓明,梅红岩,于恒,李晓会. 基于偏好融合的群组推荐方法研究综述. 小型微型计算机系统. 2020(12): 2500-2508 .

    Other cited types(13)

Catalog

    Article views (97) PDF downloads (33) Cited by(20)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return